On Wed, Jan 25, 2023 at 10:59 PM Yaniv Agman <yanivagman@xxxxxxxxx> wrote: > > בתאריך יום ה׳, 26 בינו׳ 2023 ב-4:22 מאת Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx>: > > > > On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote: > > > > > > > Anyway back to preempt_disable(). Think of it as a giant spin_lock > > > > that covers the whole program. In preemptable kernels it hurts > > > > tail latency and fairness, and is completely unacceptable in RT. > > > > That's why we moved to migrate_disable. > > > > Technically we can add bpf_preempt_disable() kfunc, but if we do that > > > > we'll be back to square one. The issues with preemptions and RT > > > > will reappear. So let's figure out a different solution. > > > > Why not use a scratch buffer per program ? > > > > > > Totally understand the reason for avoiding preemption disable, > > > especially in RT kernels. > > > I believe the answer for why not to use a scratch buffer per program > > > will simply be memory space. > > > In our use-case, Tracee [1], we let the user choose whatever events to > > > trace for a specific workload. > > > This list of events is very big, and we have many BPF programs > > > attached to different places of the kernel. > > > Let's assume that we have 100 events, and for each event we have a > > > different BPF program. > > > Then having 32kb per-cpu scratch buffers translates to 3.2MB per one > > > cpu, and ~100MB per 32 CPUs, which is more common for our case. > > > > Well, 100 bpf progs consume at least a page each, > > so you might want one program attached to all events. > > > > Unfortunately, I don't think that would be possible. We still need to > support kernels with 4096 instructions limit. > We may add some generic programs for events with simpler logic, but > even then, support for bpf cookies needed for such programs was only > added in more recent kernels. > > > > Since we always add new events to Tracee, this will also not be scalable. > > > Yet, if there is no other solution, I believe we will go in that direction > > > > > > [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c > > > > you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ? > > We actually have 3 different percpu maps there shared between > different programs so we will have to take care of them all. > > > > > Insead of scratch_map per program, use atomic per-cpu counter > > for recursion. > > You'll have 3 levels in the worst case. > > Is it guaranteed that no more than 3 levels exist? > I suggested a similar solution with 2 levels at the beginning of this > thread, but Yonghong Song replied that there is no restriction on > this. There are no restrictions, but I doubt you can construct a case where you'll see more than 3 in practice. > > So it will be: > > BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3); > > On prog entry increment the recursion counter, on exit decrement. > > And use that particular scratch_t in the prog. > > I'm just afraid of potential deadlocks. For sure we will need to > decrement the counter on each return path, but can it happen that a > bpf program crashes, leaving the counter non-zero? I know that's the > job of the verifier to make sure such things won't happen, but can we > be sure of that? if you don't trust the verifier you shouldn't be using bpf in the first place.