Re: BPF and lazy preemption.

Usama Saqib <usama.saqib@xxxxxxxxxxxxx> · Tue, 10 Dec 2024 15:48:32 +0100

Thanks for your reply. It is correct that the problem I shared is
already present under PREEMPT_FULL, and as such there is no new issue
being introduced by PREEMPT_LAZY.

My main concern is that if PREEMPT_LAZY is intended to become the
default mode (please correct me if I am wrong here) before this
problem is addressed in the BPF subsystem, then this would result in a
big regression for us. This is especially true if distros pick up the
changes in the intervening period. I wanted to draw attention to this
issue so this situation does not happen.

On Tue, Dec 10, 2024 at 3:14 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Dec 10, 2024 at 02:25:20PM +0100, Usama Saqib wrote:
> > [ Adding x86 / scheduler folks to Cc given PREEMPT_LAZY as-is would cause
> >   serious regressions for us. ]
> >
> > On 11/18/24 10:14 AM, Usama Saqib wrote:
> > > Hello,
> > >
> > > I hope everyone is doing well. It seems that work has started to
> > > introduce a new preemption model in the linux kernel PREEMPT_LAZY [1].
> > > According to the mailing list, the maintainers intend for this to
> > > replace PREEMPT_NONE and PREEMPT_VOLUTARY as the default preemption
> > > model.
> > >
> > >  From the changeset, it looks like PREEMPT_LAZY allows
> > > irqentry_exit_cond_resched() to get called on IRQ exit. This change,
> > > similar to PREEMPT_FULL, can get two bpf programs attached to a kprobe
> > > or tracepoint running in user context, to nest. This currently causes
> > > the nesting program to miss. I have been able to get these misses to
> > > happen on top of this new patch.
> > >
> > > This behavior is currently not possible with the default preemption
> > > model used in most distributions, PREEMPT_VOLUNTARY. For many products
> > > using BPF for tracing/security, this would constitute a regression in
> > > terms of reliability.
> > >
> > > My question is whether there is any ongoing work to fix this behavior
> > > of kprobes and tracepoints, so they do not miss on nesting. I have
> > > previously been told that there is ongoing work related to
> > > bpf-specific spinlocks to resolve this problem [2]. Will that be
> > > available by the time this is merged into the mainline, and the
> > > current defaults deprecated?
>
> I have no idea about the whole BPF thing, but if behaviour is as
> PREEMPT_FULL, then there is nothing to fix from a scheduler PoV.
>
> Note that most distros already build with PREEMPT_DYNAMIC, which allows
> users/admins to dynamically select the preemption model (either at boot
> or at runtime through debugfs).
>
> If certain BPF stuff cannot deal with full preemption, then I would have
> to call it broken.