Re: Are BPF programs preemptible?

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 25 Jan 2023 10:52:43 -0800

On Wed, Jan 25, 2023 at 8:39 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
>
> ‫בתאריך יום ד׳, 25 בינו׳ 2023 ב-2:04 מאת ‪Alexei Starovoitov‬‏
> <‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
> >
> > On Tue, Jan 24, 2023 at 9:38 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> > >
> > > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת ‪Alexei Starovoitov‬‏
> > > <‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
> > > >
> > > > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> > > > >
> > > > > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת ‪Alexei Starovoitov‬‏
> > > > > <‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
> > > > > >
> > > > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת ‪Jakub Sitnicki‬‏
> > > > > > > <‪jakub@xxxxxxxxxxxxxx‬‏>:‬
> > > > > > > >
> > > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> > > > > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> > > > > > > > > <‪martin.lau@xxxxxxxxx‬‏>:‬
> > > > > > > > >>
> > > > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> > > > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to
> > > > > > > > >> >>> know if my program currently interrupts the run of another program -
> > > > > > > > >> >>> is there a way to do that?
> > > > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> > > > > > > > >> same cpu.
> > > > > > > > >
> > > > > > > > > Not sure I understand how this will help. If I want to save local
> > > > > > > > > program data on a percpu map and I see that the counter is bigger then
> > > > > > > > > zero, should I ignore the event?
> > > > > > > >
> > > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating
> > > > > > > > an entry atomically. But it can't be used with PERCPU maps today.
> > > > > > > > Perhaps that's needed now too.
> > > > > > >
> > > > > > > Yep. I think what is needed here is the ability to disable preemption
> > > > > > > from the bpf program - maybe even adding a helper for that?
> > > > > >
> > > > > > I'm not sure what the issue is here.
> > > > > > Old preempt_disable() doesn't mean that one bpf program won't ever
> > > > > > be interrupted by another bpf prog.
> > > > > > Like networking bpf prog in old preempt_disable can call into something
> > > > > > where there is a kprobe and another tracing bpf prog will be called.
> > > > > > The same can happen after we switched to migrate_disable.
> > > > >
> > > > > One difference here is that in what you describe the programmer can
> > > > > know in advance which functions might call others and avoid that or
> > > > > use other percpu maps, but if preemption can happen between functions
> > > > > which are not related to one another (don't have a relation of caller
> > > > > and callee), then the programmer can't have control over it
> > > >
> > > > Could you give a specific example?
> > >
> > > Sure. I can give two examples from places where we saw such corruption:
> > >
> > > 1. We use kprobes to trace some LSM hooks, e.g. security_file_open,
> > > and a percpu scratch map to prepare the event for submit. When we also
> > > added a TRACEPOINT to trace sched_process_free (where we also use this
> > > scratch percpu map), the security_file_open events got corrupted and
> > > we didn't know what was happening (was very hard to debug)
> >
> > bpf_lsm_file_open is sleepable.
> > We never did preempt_disable there.
> > kprobe on security_file_open works, but it's a bit of a hack.
> > With preempt->migrate
> > the delayed_put_task_struct->trace_sched_process_free can happen
> > on the same cpu if prog gets preempted in preemtable kernel, but something
> > is fishy here.
> > Non-sleepable bpf progs always run in rcu cs while
> > delayed_put_task_struct is call_rcu(),
> > so if you've used a per-cpu scratch buffer for a duration
> > of bpf prog (either old preempt_disable or new migrate_disable)
> > the trace_sched_process_free won't be called until prog is finished.
> >
> > If you're using per-cpu scratch buffer to remember the data in one
> > execution of kprobe-bpf and consuming in the next then all bets are off.
> > This is going to break sooner or later.
> >
>
> Yes, I agree this is fishy, but we only use the per-cpu scratch map
> for the duration of a bpf program, assuming tail-calls are considered
> to be part of the same program.
> We've even spotted the place where preemption happens to be right
> after calling bpf_perf_event_submit() helper by placing bpf_printk in
> the security_file_open kprobe. Before the call to the helper the
> buffer was ok, but after it got corrupted (and there were more lines
> in the program after the call to this helper).

I see. bpf_perf_event_output() may trigger irq_work->arch_irq_work_raise
which will send an ipi to the current cpu which may resched.

I'm still missing why kprobe in security_file_open has to share
per-cpu scratch buffer with kprobe in trace_sched_process_free.
Why not use a scratch buffer per program?

>
> > > 2. same was happening when kprobes were combined with cgroup_skb
> > > programs to trace network events
> >
> > irqs can interrupt old preempt_disable, so depending on where kprobe-bpf
> > is it can run while cgroup-bpf hasn't finished.
> >
>
> Actually what we saw here is that the ingress skb program that runs
> from ksoftirq/idle context corrupts a percpu map shared with a
> raw_sys_exit tracepoint used to submit execve/openat events.
>
> > Then there are differences in kernel configs: no preempt, full preempt,
> > preempt_rcu and differences in rcu_tasks_trace implementation
> > over the years. You can only assume that the validity of the pointer
> > to bpf per-cpu array and bpf per-cpu hash, but no assumption
> > about concurrent access to the same map if you share the same map
> > across different programs.
>
> I think that is the main point here. Thanks for clarifying this.
>
> > See BPF_F_LOCK earlier suggestion. It might be what you need.
>
> I am not sure how exactly this can help. The size of the percpu buffer
> we use is 32kb and we don't use the map_update helper but the pointer
> returned from the lookup to update the map in several places in the
> program.

It wasn't clear what you do with the scratch buffer.
BPF_F_LOCK will help if user space wants to bpf_lookup_elem it without
a race with bpf prog that would use bpf_spin_lock() to access the element.
Looks like that's not the case you had in mind.
You just stream these buffers to user space via bpf_perf_event_output().

Anyway back to preempt_disable(). Think of it as a giant spin_lock
that covers the whole program. In preemptable kernels it hurts
tail latency and fairness, and is completely unacceptable in RT.
That's why we moved to migrate_disable.
Technically we can add bpf_preempt_disable() kfunc, but if we do that
we'll be back to square one. The issues with preemptions and RT
will reappear. So let's figure out a different solution.
Why not use a scratch buffer per program ?