Thanks for your reply. It is correct that the problem I shared is already present under PREEMPT_FULL, and as such there is no new issue being introduced by PREEMPT_LAZY. My main concern is that if PREEMPT_LAZY is intended to become the default mode (please correct me if I am wrong here) before this problem is addressed in the BPF subsystem, then this would result in a big regression for us. This is especially true if distros pick up the changes in the intervening period. I wanted to draw attention to this issue so this situation does not happen. On Tue, Dec 10, 2024 at 3:14 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Tue, Dec 10, 2024 at 02:25:20PM +0100, Usama Saqib wrote: > > [ Adding x86 / scheduler folks to Cc given PREEMPT_LAZY as-is would cause > > serious regressions for us. ] > > > > On 11/18/24 10:14 AM, Usama Saqib wrote: > > > Hello, > > > > > > I hope everyone is doing well. It seems that work has started to > > > introduce a new preemption model in the linux kernel PREEMPT_LAZY [1]. > > > According to the mailing list, the maintainers intend for this to > > > replace PREEMPT_NONE and PREEMPT_VOLUTARY as the default preemption > > > model. > > > > > > From the changeset, it looks like PREEMPT_LAZY allows > > > irqentry_exit_cond_resched() to get called on IRQ exit. This change, > > > similar to PREEMPT_FULL, can get two bpf programs attached to a kprobe > > > or tracepoint running in user context, to nest. This currently causes > > > the nesting program to miss. I have been able to get these misses to > > > happen on top of this new patch. > > > > > > This behavior is currently not possible with the default preemption > > > model used in most distributions, PREEMPT_VOLUNTARY. For many products > > > using BPF for tracing/security, this would constitute a regression in > > > terms of reliability. > > > > > > My question is whether there is any ongoing work to fix this behavior > > > of kprobes and tracepoints, so they do not miss on nesting. I have > > > previously been told that there is ongoing work related to > > > bpf-specific spinlocks to resolve this problem [2]. Will that be > > > available by the time this is merged into the mainline, and the > > > current defaults deprecated? > > I have no idea about the whole BPF thing, but if behaviour is as > PREEMPT_FULL, then there is nothing to fix from a scheduler PoV. > > Note that most distros already build with PREEMPT_DYNAMIC, which allows > users/admins to dynamically select the preemption model (either at boot > or at runtime through debugfs). > > If certain BPF stuff cannot deal with full preemption, then I would have > to call it broken.