Re: Are BPF programs preemptible?

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 24 Jan 2023 16:04:42 -0800

On Tue, Jan 24, 2023 at 9:38 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
>
> ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת ‪Alexei Starovoitov‬‏
> <‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
> >
> > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> > >
> > > ‫בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת ‪Alexei Starovoitov‬‏
> > > <‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
> > > >
> > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> > > > >
> > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת ‪Jakub Sitnicki‬‏
> > > > > <‪jakub@xxxxxxxxxxxxxx‬‏>:‬
> > > > > >
> > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote:
> > > > > > > ‫בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת ‪Martin KaFai Lau‬‏
> > > > > > > <‪martin.lau@xxxxxxxxx‬‏>:‬
> > > > > > >>
> > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote:
> > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to
> > > > > > >> >>> know if my program currently interrupts the run of another program -
> > > > > > >> >>> is there a way to do that?
> > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the
> > > > > > >> same cpu.
> > > > > > >
> > > > > > > Not sure I understand how this will help. If I want to save local
> > > > > > > program data on a percpu map and I see that the counter is bigger then
> > > > > > > zero, should I ignore the event?
> > > > > >
> > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating
> > > > > > an entry atomically. But it can't be used with PERCPU maps today.
> > > > > > Perhaps that's needed now too.
> > > > >
> > > > > Yep. I think what is needed here is the ability to disable preemption
> > > > > from the bpf program - maybe even adding a helper for that?
> > > >
> > > > I'm not sure what the issue is here.
> > > > Old preempt_disable() doesn't mean that one bpf program won't ever
> > > > be interrupted by another bpf prog.
> > > > Like networking bpf prog in old preempt_disable can call into something
> > > > where there is a kprobe and another tracing bpf prog will be called.
> > > > The same can happen after we switched to migrate_disable.
> > >
> > > One difference here is that in what you describe the programmer can
> > > know in advance which functions might call others and avoid that or
> > > use other percpu maps, but if preemption can happen between functions
> > > which are not related to one another (don't have a relation of caller
> > > and callee), then the programmer can't have control over it
> >
> > Could you give a specific example?
>
> Sure. I can give two examples from places where we saw such corruption:
>
> 1. We use kprobes to trace some LSM hooks, e.g. security_file_open,
> and a percpu scratch map to prepare the event for submit. When we also
> added a TRACEPOINT to trace sched_process_free (where we also use this
> scratch percpu map), the security_file_open events got corrupted and
> we didn't know what was happening (was very hard to debug)

bpf_lsm_file_open is sleepable.
We never did preempt_disable there.
kprobe on security_file_open works, but it's a bit of a hack.
With preempt->migrate
the delayed_put_task_struct->trace_sched_process_free can happen
on the same cpu if prog gets preempted in preemtable kernel, but something
is fishy here.
Non-sleepable bpf progs always run in rcu cs while
delayed_put_task_struct is call_rcu(),
so if you've used a per-cpu scratch buffer for a duration
of bpf prog (either old preempt_disable or new migrate_disable)
the trace_sched_process_free won't be called until prog is finished.

If you're using per-cpu scratch buffer to remember the data in one
execution of kprobe-bpf and consuming in the next then all bets are off.
This is going to break sooner or later.

> 2. same was happening when kprobes were combined with cgroup_skb
> programs to trace network events

irqs can interrupt old preempt_disable, so depending on where kprobe-bpf
is it can run while cgroup-bpf hasn't finished.

Then there are differences in kernel configs: no preempt, full preempt,
preempt_rcu and differences in rcu_tasks_trace implementation
over the years. You can only assume that the validity of the pointer
to bpf per-cpu array and bpf per-cpu hash, but no assumption
about concurrent access to the same map if you share the same map
across different programs.
See BPF_F_LOCK earlier suggestion. It might be what you need.