Re: Are BPF programs preemptible?

Yaniv Agman <yanivagman@xxxxxxxxx> · Wed, 25 Jan 2023 21:59:33 +0200

‫בתאריך יום ד׳, 25 בינו׳ 2023 ב-20:52 מאת ‪Alexei Starovoitov‬‏
<‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
>
> On Wed, Jan 25, 2023 at 8:39 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> >
> > ‫בתאריך יום ד׳, 25 בינו׳ 2023 ב-2:04 מאת ‪Alexei Starovoitov‬‏
> > <‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
> > >
> > > On Tue, Jan 24, 2023 at 9:38 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> > > >
> > > > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת ‪Alexei Starovoitov‬‏
> > > > <‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
> > > > >
> > > > > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> > > > > >
> > > > > > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת ‪Alexei Starovoitov‬‏
> > > > > > <‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
> > > > > > >
> > > > > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת ‪Jakub Sitnicki‬‏
> > > > > > > > <‪jakub@xxxxxxxxxxxxxx‬‏>:‬
> > > > > > > > >
> > > > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> > > > > > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> > > > > > > > > > <‪martin.lau@xxxxxxxxx‬‏>:‬
> > > > > > > > > >>
> > > > > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> > > > > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to
> > > > > > > > > >> >>> know if my program currently interrupts the run of another program -
> > > > > > > > > >> >>> is there a way to do that?
> > > > > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> > > > > > > > > >> same cpu.
> > > > > > > > > >
> > > > > > > > > > Not sure I understand how this will help. If I want to save local
> > > > > > > > > > program data on a percpu map and I see that the counter is bigger then
> > > > > > > > > > zero, should I ignore the event?
> > > > > > > > >
> > > > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating
> > > > > > > > > an entry atomically. But it can't be used with PERCPU maps today.
> > > > > > > > > Perhaps that's needed now too.
> > > > > > > >
> > > > > > > > Yep. I think what is needed here is the ability to disable preemption
> > > > > > > > from the bpf program - maybe even adding a helper for that?
> > > > > > >
> > > > > > > I'm not sure what the issue is here.
> > > > > > > Old preempt_disable() doesn't mean that one bpf program won't ever
> > > > > > > be interrupted by another bpf prog.
> > > > > > > Like networking bpf prog in old preempt_disable can call into something
> > > > > > > where there is a kprobe and another tracing bpf prog will be called.
> > > > > > > The same can happen after we switched to migrate_disable.
> > > > > >
> > > > > > One difference here is that in what you describe the programmer can
> > > > > > know in advance which functions might call others and avoid that or
> > > > > > use other percpu maps, but if preemption can happen between functions
> > > > > > which are not related to one another (don't have a relation of caller
> > > > > > and callee), then the programmer can't have control over it
> > > > >
> > > > > Could you give a specific example?
> > > >
> > > > Sure. I can give two examples from places where we saw such corruption:
> > > >
> > > > 1. We use kprobes to trace some LSM hooks, e.g. security_file_open,
> > > > and a percpu scratch map to prepare the event for submit. When we also
> > > > added a TRACEPOINT to trace sched_process_free (where we also use this
> > > > scratch percpu map), the security_file_open events got corrupted and
> > > > we didn't know what was happening (was very hard to debug)
> > >
> > > bpf_lsm_file_open is sleepable.
> > > We never did preempt_disable there.
> > > kprobe on security_file_open works, but it's a bit of a hack.
> > > With preempt->migrate
> > > the delayed_put_task_struct->trace_sched_process_free can happen
> > > on the same cpu if prog gets preempted in preemtable kernel, but something
> > > is fishy here.
> > > Non-sleepable bpf progs always run in rcu cs while
> > > delayed_put_task_struct is call_rcu(),
> > > so if you've used a per-cpu scratch buffer for a duration
> > > of bpf prog (either old preempt_disable or new migrate_disable)
> > > the trace_sched_process_free won't be called until prog is finished.
> > >
> > > If you're using per-cpu scratch buffer to remember the data in one
> > > execution of kprobe-bpf and consuming in the next then all bets are off.
> > > This is going to break sooner or later.
> > >
> >
> > Yes, I agree this is fishy, but we only use the per-cpu scratch map
> > for the duration of a bpf program, assuming tail-calls are considered
> > to be part of the same program.
> > We've even spotted the place where preemption happens to be right
> > after calling bpf_perf_event_submit() helper by placing bpf_printk in
> > the security_file_open kprobe. Before the call to the helper the
> > buffer was ok, but after it got corrupted (and there were more lines
> > in the program after the call to this helper).
>
> I see. bpf_perf_event_output() may trigger irq_work->arch_irq_work_raise
> which will send an ipi to the current cpu which may resched.
>
> I'm still missing why kprobe in security_file_open has to share
> per-cpu scratch buffer with kprobe in trace_sched_process_free.
> Why not use a scratch buffer per program?
>

Well, since all the documentation I read about eBPF says that BPF
programs are non-preemptible, I assumed it will be ok if we use the
same per-cpu scratch buffer for all programs as long as we use it only
in the scope of the program, but now I understand this assumption (and
documentation) was wrong.

> >
> > > > 2. same was happening when kprobes were combined with cgroup_skb
> > > > programs to trace network events
> > >
> > > irqs can interrupt old preempt_disable, so depending on where kprobe-bpf
> > > is it can run while cgroup-bpf hasn't finished.
> > >
> >
> > Actually what we saw here is that the ingress skb program that runs
> > from ksoftirq/idle context corrupts a percpu map shared with a
> > raw_sys_exit tracepoint used to submit execve/openat events.
> >
> > > Then there are differences in kernel configs: no preempt, full preempt,
> > > preempt_rcu and differences in rcu_tasks_trace implementation
> > > over the years. You can only assume that the validity of the pointer
> > > to bpf per-cpu array and bpf per-cpu hash, but no assumption
> > > about concurrent access to the same map if you share the same map
> > > across different programs.
> >
> > I think that is the main point here. Thanks for clarifying this.
> >
> > > See BPF_F_LOCK earlier suggestion. It might be what you need.
> >
> > I am not sure how exactly this can help. The size of the percpu buffer
> > we use is 32kb and we don't use the map_update helper but the pointer
> > returned from the lookup to update the map in several places in the
> > program.
>
> It wasn't clear what you do with the scratch buffer.
> BPF_F_LOCK will help if user space wants to bpf_lookup_elem it without
> a race with bpf prog that would use bpf_spin_lock() to access the element.
> Looks like that's not the case you had in mind.
> You just stream these buffers to user space via bpf_perf_event_output().
>

exactly

> Anyway back to preempt_disable(). Think of it as a giant spin_lock
> that covers the whole program. In preemptable kernels it hurts
> tail latency and fairness, and is completely unacceptable in RT.
> That's why we moved to migrate_disable.
> Technically we can add bpf_preempt_disable() kfunc, but if we do that
> we'll be back to square one. The issues with preemptions and RT
> will reappear. So let's figure out a different solution.
> Why not use a scratch buffer per program ?

Totally understand the reason for avoiding preemption disable,
especially in RT kernels.
I believe the answer for why not to use a scratch buffer per program
will simply be memory space.
In our use-case, Tracee [1], we let the user choose whatever events to
trace for a specific workload.
This list of events is very big, and we have many BPF programs
attached to different places of the kernel.
Let's assume that we have 100 events, and for each event we have a
different BPF program.
Then having 32kb per-cpu scratch buffers translates to 3.2MB per one
cpu, and ~100MB per 32 CPUs, which is more common for our case.
Since we always add new events to Tracee, this will also not be scalable.
Yet, if there is no other solution, I believe we will go in that direction

[1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c