Re: [PATCH] BPF: Disable on PREEMPT_RT

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 18 Oct 2019 16:05:43 -0700

On Fri, Oct 18, 2019 at 01:28:21PM +0200, Thomas Gleixner wrote:
> 
> On a non RT enabled build these primitives just resolve to
> preempt_disable(), local_bh_disable(), local_irq_disable() and
> local_irq_save().
> 
> On RT the local lock is actually a per CPU lock which allows nesting. i.e.
> 
>       preempt_disable();
>       ...
>       local_irq_disable();
> 
> becomes
> 
> 	local_lock(this_scope);
> 	...
> 	local_lock_irq(this_scope);
> 
> The local lock is a 'sleeping' spinlock on RT (PI support) and as any other
> RT substituted lock it also ensures that the task cannot be migrated when
> it is held, which makes per cpu assumptions work - the kernel has lots of
> them. :)
> 
> That works as long as the scope is well defined and clear. It does not work
> when preempt_disable() or any of the other scopeless protections is used to
> protect random (unidentifiable) code against each other, which means the
> protection has the dreaded per CPU BKL semantics, i.e. undefined.
> 
> One nice thing about local_lock even aside of RT is that it annotates the
> code in terms of protection scope which actually gives you also lockdep
> coverage. We found already a few bugs that way in the past, where data was
> protected with preempt_disable() when the code was introduced and later
> access from interrupt code was added without anyone noticing for years....

The concept on local_lock() makes sense to me.
The magic macro you're proposing that will convert it to old school
preempt_disable() on !RT should hopefully make the changes across
net and bpf land mostly mechanical.
One thing to clarify:
when networking core interacts with bpf we know that bh doesn't migrate,
so per-cpu datastructres that bpf side populates are accessed later
by networking core when program finishes.
Similar thing happens between tracing bits and bpf progs.
It feels to me that local_lock() approach should work here as well.
napi processing bits will grab it. Later bpf will grab potentially
the same lock again.
The weird bit that such lock will have numbe_of_lockers >= 1
for long periods of time. At least until napi runners won't see
any more incoming packets. I'm not sure yet where such local_lock
will be placed in the napi code (may be in drivers too for xdp).
Does this make sense from RT perspective?

> > BPF also doesn't have unbound runtime.
> > So two above issues are actually non-issues.
> 
> That'd be nice :)
> 
> Anyway, we'll have a look whether this can be solved with local locks which
> would be nice, but that still does not solve the issue with the non_owner
> release of the rwsem.

Sure. We can discuss this separately.
up_read_non_owner is used only by build_id mode of stack collectors.
We can disable it for RT for long time. We're using stack with build_id heavily
and have no plans to use RT. I believe datacenters, in general, are not going
to use RT for foreseeable future, so a trade off between stack with build_id
vs RT, I think, is acceptable.

But reading your other replies the gradual approach we're discussing here
doesn't sound acceptable ? And you guys insist on disabling bpf under RT
just to merge some out of tree code ? I find this rude and not acceptable.

If RT wants to get merged it should be disabled when BPF is on
and not the other way around.
Something like this:

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index deff97217496..b3dbc5f9a6de 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -57,7 +57,7 @@ config PREEMPT

 config PREEMPT_RT
        bool "Fully Preemptible Kernel (Real-Time)"
-       depends on EXPERT && ARCH_SUPPORTS_RT
+       depends on EXPERT && ARCH_SUPPORTS_RT && !BPF_SYSCALL
        select PREEMPTION
        help
          This option turns the kernel into a real-time kernel by replacing