בתאריך יום ה׳, 26 בינו׳ 2023 ב-17:29 מאת Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx>: > > On Wed, Jan 25, 2023 at 10:59 PM Yaniv Agman <yanivagman@xxxxxxxxx> wrote: > > > > בתאריך יום ה׳, 26 בינו׳ 2023 ב-4:22 מאת Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx>: > > > > > > On Wed, Jan 25, 2023 at 11:59 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote: > > > > > > > > > Anyway back to preempt_disable(). Think of it as a giant spin_lock > > > > > that covers the whole program. In preemptable kernels it hurts > > > > > tail latency and fairness, and is completely unacceptable in RT. > > > > > That's why we moved to migrate_disable. > > > > > Technically we can add bpf_preempt_disable() kfunc, but if we do that > > > > > we'll be back to square one. The issues with preemptions and RT > > > > > will reappear. So let's figure out a different solution. > > > > > Why not use a scratch buffer per program ? > > > > > > > > Totally understand the reason for avoiding preemption disable, > > > > especially in RT kernels. > > > > I believe the answer for why not to use a scratch buffer per program > > > > will simply be memory space. > > > > In our use-case, Tracee [1], we let the user choose whatever events to > > > > trace for a specific workload. > > > > This list of events is very big, and we have many BPF programs > > > > attached to different places of the kernel. > > > > Let's assume that we have 100 events, and for each event we have a > > > > different BPF program. > > > > Then having 32kb per-cpu scratch buffers translates to 3.2MB per one > > > > cpu, and ~100MB per 32 CPUs, which is more common for our case. > > > > > > Well, 100 bpf progs consume at least a page each, > > > so you might want one program attached to all events. > > > > > > > Unfortunately, I don't think that would be possible. We still need to > > support kernels with 4096 instructions limit. > > We may add some generic programs for events with simpler logic, but > > even then, support for bpf cookies needed for such programs was only > > added in more recent kernels. > > > > > > Since we always add new events to Tracee, this will also not be scalable. > > > > Yet, if there is no other solution, I believe we will go in that direction > > > > > > > > [1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c > > > > > > you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ? > > > > We actually have 3 different percpu maps there shared between > > different programs so we will have to take care of them all. > > > > > > > > Insead of scratch_map per program, use atomic per-cpu counter > > > for recursion. > > > You'll have 3 levels in the worst case. > > > > Is it guaranteed that no more than 3 levels exist? > > I suggested a similar solution with 2 levels at the beginning of this > > thread, but Yonghong Song replied that there is no restriction on > > this. > > There are no restrictions, but I doubt you can construct a case > where you'll see more than 3 in practice. > Ok, I see. > > > So it will be: > > > BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3); > > > On prog entry increment the recursion counter, on exit decrement. > > > And use that particular scratch_t in the prog. > > > > I'm just afraid of potential deadlocks. For sure we will need to > > decrement the counter on each return path, but can it happen that a > > bpf program crashes, leaving the counter non-zero? I know that's the > > job of the verifier to make sure such things won't happen, but can we > > be sure of that? > > if you don't trust the verifier you shouldn't be using bpf in the first place. Oh believe me I trust it, but I must be honest and say that I don't know about all the intricacies of the kernel, especially when it comes to scheduling and preemption. So I am just trying to think if there is a path where this mechanism can break, and express my concern about the potential for errors or bugs that may lead to unexpected behaviors.