On Fri, Oct 18, 2019 at 01:28:21PM +0200, Thomas Gleixner wrote: > > On a non RT enabled build these primitives just resolve to > preempt_disable(), local_bh_disable(), local_irq_disable() and > local_irq_save(). > > On RT the local lock is actually a per CPU lock which allows nesting. i.e. > > preempt_disable(); > ... > local_irq_disable(); > > becomes > > local_lock(this_scope); > ... > local_lock_irq(this_scope); > > The local lock is a 'sleeping' spinlock on RT (PI support) and as any other > RT substituted lock it also ensures that the task cannot be migrated when > it is held, which makes per cpu assumptions work - the kernel has lots of > them. :) > > That works as long as the scope is well defined and clear. It does not work > when preempt_disable() or any of the other scopeless protections is used to > protect random (unidentifiable) code against each other, which means the > protection has the dreaded per CPU BKL semantics, i.e. undefined. > > One nice thing about local_lock even aside of RT is that it annotates the > code in terms of protection scope which actually gives you also lockdep > coverage. We found already a few bugs that way in the past, where data was > protected with preempt_disable() when the code was introduced and later > access from interrupt code was added without anyone noticing for years.... The concept on local_lock() makes sense to me. The magic macro you're proposing that will convert it to old school preempt_disable() on !RT should hopefully make the changes across net and bpf land mostly mechanical. One thing to clarify: when networking core interacts with bpf we know that bh doesn't migrate, so per-cpu datastructres that bpf side populates are accessed later by networking core when program finishes. Similar thing happens between tracing bits and bpf progs. It feels to me that local_lock() approach should work here as well. napi processing bits will grab it. Later bpf will grab potentially the same lock again. The weird bit that such lock will have numbe_of_lockers >= 1 for long periods of time. At least until napi runners won't see any more incoming packets. I'm not sure yet where such local_lock will be placed in the napi code (may be in drivers too for xdp). Does this make sense from RT perspective? > > BPF also doesn't have unbound runtime. > > So two above issues are actually non-issues. > > That'd be nice :) > > Anyway, we'll have a look whether this can be solved with local locks which > would be nice, but that still does not solve the issue with the non_owner > release of the rwsem. Sure. We can discuss this separately. up_read_non_owner is used only by build_id mode of stack collectors. We can disable it for RT for long time. We're using stack with build_id heavily and have no plans to use RT. I believe datacenters, in general, are not going to use RT for foreseeable future, so a trade off between stack with build_id vs RT, I think, is acceptable. But reading your other replies the gradual approach we're discussing here doesn't sound acceptable ? And you guys insist on disabling bpf under RT just to merge some out of tree code ? I find this rude and not acceptable. If RT wants to get merged it should be disabled when BPF is on and not the other way around. Something like this: diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index deff97217496..b3dbc5f9a6de 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -57,7 +57,7 @@ config PREEMPT config PREEMPT_RT bool "Fully Preemptible Kernel (Real-Time)" - depends on EXPERT && ARCH_SUPPORTS_RT + depends on EXPERT && ARCH_SUPPORTS_RT && !BPF_SYSCALL select PREEMPTION help This option turns the kernel into a real-time kernel by replacing