Re: Are BPF programs preemptible?

Yaniv Agman <yanivagman@xxxxxxxxx> · Thu, 26 Jan 2023 19:51:55 +0200

‫בתאריך יום ה׳, 26 בינו׳ 2023 ב-17:29 מאת ‪Alexei Starovoitov‬‏
<‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
>
> On Wed, Jan 25, 2023 at 10:59 PM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> >
> > ‫בתאריך יום ה׳, 26 בינו׳ 2023 ב-4:22 מאת ‪Alexei Starovoitov‬‏
> > <‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
> > >
> > > On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> > > >
> > > > > Anyway back to preempt_disable(). Think of it as a giant spin_lock
> > > > > that covers the whole program. In preemptable kernels it hurts
> > > > > tail latency and fairness, and is completely unacceptable in RT.
> > > > > That's why we moved to migrate_disable.
> > > > > Technically we can add bpf_preempt_disable() kfunc, but if we do that
> > > > > we'll be back to square one. The issues with preemptions and RT
> > > > > will reappear. So let's figure out a different solution.
> > > > > Why not use a scratch buffer per program ?
> > > >
> > > > Totally understand the reason for avoiding preemption disable,
> > > > especially in RT kernels.
> > > > I believe the answer for why not to use a scratch buffer per program
> > > > will simply be memory space.
> > > > In our use-case, Tracee [1], we let the user choose whatever events to
> > > > trace for a specific workload.
> > > > This list of events is very big, and we have many BPF programs
> > > > attached to different places of the kernel.
> > > > Let's assume that we have 100 events, and for each event we have a
> > > > different BPF program.
> > > > Then having 32kb per-cpu scratch buffers translates to 3.2MB per one
> > > > cpu, and ~100MB per 32 CPUs, which is more common for our case.
> > >
> > > Well, 100 bpf progs consume at least a page each,
> > > so you might want one program attached to all events.
> > >
> >
> > Unfortunately, I don't think that would be possible. We still need to
> > support kernels with 4096 instructions limit.
> > We may add some generic programs for events with simpler logic, but
> > even then, support for bpf cookies needed for such programs was only
> > added in more recent kernels.
> >
> > > > Since we always add new events to Tracee, this will also not be scalable.
> > > > Yet, if there is no other solution, I believe we will go in that direction
> > > >
> > > > [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c
> > >
> > > you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ?
> >
> > We actually have 3 different percpu maps there shared between
> > different programs so we will have to take care of them all.
> >
> > >
> > > Insead of scratch_map per program, use atomic per-cpu counter
> > > for recursion.
> > > You'll have 3 levels in the worst case.
> >
> > Is it guaranteed that no more than 3 levels exist?
> > I suggested a similar solution with 2 levels at the beginning of this
> > thread, but Yonghong Song replied that there is no restriction on
> > this.
>
> There are no restrictions, but I doubt you can construct a case
> where you'll see more than 3 in practice.
>

Ok, I see.

> > > So it will be:
> > > BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3);
> > > On prog entry increment the recursion counter, on exit decrement.
> > > And use that particular scratch_t in the prog.
> >
> > I'm just afraid of potential deadlocks. For sure we will need to
> > decrement the counter on each return path, but can it happen that a
> > bpf program crashes, leaving the counter non-zero? I know that's the
> > job of the verifier to make sure such things won't happen, but can we
> > be sure of that?
>
> if you don't trust the verifier you shouldn't be using bpf in the first place.

Oh believe me I trust it, but I must be honest and say that I don't
know about all the intricacies of the kernel, especially when it comes
to scheduling and preemption.
So I am just trying to think if there is a path where this mechanism
can break, and express my concern about the potential for errors or
bugs that may lead to unexpected behaviors.