Re: Are BPF programs preemptible?

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 26 Jan 2023 07:29:19 -0800

On Wed, Jan 25, 2023 at 10:59 PM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
>
> ‫בתאריך יום ה׳, 26 בינו׳ 2023 ב-4:22 מאת ‪Alexei Starovoitov‬‏
> <‪alexei.starovoitov@xxxxxxxxx‬‏>:‬
> >
> > On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote:
> > >
> > > > Anyway back to preempt_disable(). Think of it as a giant spin_lock
> > > > that covers the whole program. In preemptable kernels it hurts
> > > > tail latency and fairness, and is completely unacceptable in RT.
> > > > That's why we moved to migrate_disable.
> > > > Technically we can add bpf_preempt_disable() kfunc, but if we do that
> > > > we'll be back to square one. The issues with preemptions and RT
> > > > will reappear. So let's figure out a different solution.
> > > > Why not use a scratch buffer per program ?
> > >
> > > Totally understand the reason for avoiding preemption disable,
> > > especially in RT kernels.
> > > I believe the answer for why not to use a scratch buffer per program
> > > will simply be memory space.
> > > In our use-case, Tracee [1], we let the user choose whatever events to
> > > trace for a specific workload.
> > > This list of events is very big, and we have many BPF programs
> > > attached to different places of the kernel.
> > > Let's assume that we have 100 events, and for each event we have a
> > > different BPF program.
> > > Then having 32kb per-cpu scratch buffers translates to 3.2MB per one
> > > cpu, and ~100MB per 32 CPUs, which is more common for our case.
> >
> > Well, 100 bpf progs consume at least a page each,
> > so you might want one program attached to all events.
> >
>
> Unfortunately, I don't think that would be possible. We still need to
> support kernels with 4096 instructions limit.
> We may add some generic programs for events with simpler logic, but
> even then, support for bpf cookies needed for such programs was only
> added in more recent kernels.
>
> > > Since we always add new events to Tracee, this will also not be scalable.
> > > Yet, if there is no other solution, I believe we will go in that direction
> > >
> > > [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c
> >
> > you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ?
>
> We actually have 3 different percpu maps there shared between
> different programs so we will have to take care of them all.
>
> >
> > Insead of scratch_map per program, use atomic per-cpu counter
> > for recursion.
> > You'll have 3 levels in the worst case.
>
> Is it guaranteed that no more than 3 levels exist?
> I suggested a similar solution with 2 levels at the beginning of this
> thread, but Yonghong Song replied that there is no restriction on
> this.

There are no restrictions, but I doubt you can construct a case
where you'll see more than 3 in practice.

> > So it will be:
> > BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3);
> > On prog entry increment the recursion counter, on exit decrement.
> > And use that particular scratch_t in the prog.
>
> I'm just afraid of potential deadlocks. For sure we will need to
> decrement the counter on each return path, but can it happen that a
> bpf program crashes, leaving the counter non-zero? I know that's the
> job of the verifier to make sure such things won't happen, but can we
> be sure of that?

if you don't trust the verifier you shouldn't be using bpf in the first place.