Re: Are BPF programs preemptible?

Yaniv Agman <yanivagman@xxxxxxxxx> · Thu, 26 Jan 2023 08:59:42 +0200

‫בתאריך יום ה׳, 26 בינו׳ 2023 ב-4:22 מאת ‪Alexei Starovoitov‬‏
<‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
>
> On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> >
> > > Anyway back to preempt_disable(). Think of it as a giant spin_lock
> > > that covers the whole program. In preemptable kernels it hurts
> > > tail latency and fairness, and is completely unacceptable in RT.
> > > That's why we moved to migrate_disable.
> > > Technically we can add bpf_preempt_disable() kfunc, but if we do that
> > > we'll be back to square one. The issues with preemptions and RT
> > > will reappear. So let's figure out a different solution.
> > > Why not use a scratch buffer per program ?
> >
> > Totally understand the reason for avoiding preemption disable,
> > especially in RT kernels.
> > I believe the answer for why not to use a scratch buffer per program
> > will simply be memory space.
> > In our use-case, Tracee [1], we let the user choose whatever events to
> > trace for a specific workload.
> > This list of events is very big, and we have many BPF programs
> > attached to different places of the kernel.
> > Let's assume that we have 100 events, and for each event we have a
> > different BPF program.
> > Then having 32kb per-cpu scratch buffers translates to 3.2MB per one
> > cpu, and ~100MB per 32 CPUs, which is more common for our case.
>
> Well, 100 bpf progs consume at least a page each,
> so you might want one program attached to all events.
>

Unfortunately, I don't think that would be possible. We still need to
support kernels with 4096 instructions limit.
We may add some generic programs for events with simpler logic, but
even then, support for bpf cookies needed for such programs was only
added in more recent kernels.

> > Since we always add new events to Tracee, this will also not be scalable.
> > Yet, if there is no other solution, I believe we will go in that direction
> >
> > [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c
>
> you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ?

We actually have 3 different percpu maps there shared between
different programs so we will have to take care of them all.

>
> Insead of scratch_map per program, use atomic per-cpu counter
> for recursion.
> You'll have 3 levels in the worst case.

Is it guaranteed that no more than 3 levels exist?
I suggested a similar solution with 2 levels at the beginning of this
thread, but Yonghong Song replied that there is no restriction on
this.

> So it will be:
> BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3);
> On prog entry increment the recursion counter, on exit decrement.
> And use that particular scratch_t in the prog.

I'm just afraid of potential deadlocks. For sure we will need to
decrement the counter on each return path, but can it happen that a
bpf program crashes, leaving the counter non-zero? I know that's the
job of the verifier to make sure such things won't happen, but can we
be sure of that?