+acme On Thu, 17 Oct 2019 23:54:07 +0200 (CEST) Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > On Thu, 17 Oct 2019, David Miller wrote: > > > From: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> > > Date: Thu, 17 Oct 2019 17:40:21 +0200 > > > > > On 2019-10-17 16:53:58 [+0200], Daniel Borkmann wrote: > > >> On Thu, Oct 17, 2019 at 11:05:01AM +0200, Sebastian Andrzej Siewior wrote: > > >> > Disable BPF on PREEMPT_RT because > > >> > - it allocates and frees memory in atomic context > > >> > - it uses up_read_non_owner() > > >> > - BPF_PROG_RUN() expects to be invoked in non-preemptible context > > >> > > >> For the latter you'd also need to disable seccomp-BPF and everything > > >> cBPF related as they are /all/ invoked via BPF_PROG_RUN() ... > > > > > > I looked at tracing and it depended on BPF_SYSCALL so I assumed they all > > > do… Now looking for BPF_PROG_RUN() there is PPP_FILTER, > > > NET_TEAM_MODE_LOADBALANCE and probably more. I didn't find a symbol for > > > seccomp-BPF. > > > Would it make sense to override BPF_PROG_RUN() and make each caller fail > > > instead? Other recommendations? > > > > I hope you understand that basically you are disabling any packet sniffing > > on the system with this patch you are proposing. > > > > This means no tcpdump, not wireshark, etc. They will all become > > non-functional. > > > > Turning off BPF just because PREEMPT_RT is enabled is a non-starter it is > > absolutely essential functionality for a Linux system at this point. > > I'm all ears for an alternative solution. Here are the pain points: > > #1) BPF disables preemption unconditionally with no way to do a proper RT > substitution like most other infrastructure in the kernel provides > via spinlocks or other locking primitives. As I understand it, BPF programs cannot loop and are limited to 4096 instructions. Has anyone done any timing to see just how much having preemption off while a BPF program executes is going to affect us? Are we talking 1us or 50us? or longer? I wonder if there's some instrumentation we could use to determine the maximum time spent running a BPF program. Maybe some perf mojo... > > #2) BPF does allocations in atomic contexts, which is a dubious decision > even for non RT. That's related to #1 I guess my question here is, are the allocations done on behalf of an about-to-run BPF program, or as a result of executing BPF code? Is it something we might be able to satisfy from a pre-allocated pool rather than kmalloc()? Ok, I need to go dive into BPF a bit deeper. > > #3) BPF uses the up_read_non_owner() hackery which was only invented to > deal with already existing horrors and not meant to be proliferated. > > Yes, I know it's a existing facility .... I'm sure I'll regret asking this, but why is up_read_non_owner() a horror? I mean, I get the fundamental wrongness of having someone that's not the owner of a semaphore performing an 'up' on it, but is there an RT-specific reason that it's bad? Is it totally a blocker for using BPF with RT or is it something we should fix over time? > > TBH, I have no idea how to deal with those things. So the only way forward > for RT right now is to disable the whole thing. > > Clark might have some insight from the product side for you how much that > impacts usability. > > Thanks, > > tglx Clark is only just starting his journey with BPF, so not an expert. I do think that we (RT) are going to have to co-exist with BPF, if only due to the increased use of XDP. I also think that other sub-systems will start to employ BPF for production purposes (as opposed to debug/analysis which is how we generally look at tracing, packet sniffing, etc.). I think we *have* to figure out how to co-exist. Guess my "hey, that look interesting, think I'll leisurely read up on it" just got a little less leisurely. I'm out most of the day tomorrow but I'll catch up on email over the weekend. Clark -- The United States Coast Guard Ruining Natural Selection since 1790