On Wed, Jan 25, 2023 at 8:39 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote: > > בתאריך יום ד׳, 25 בינו׳ 2023 ב-2:04 מאת Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx>: > > > > On Tue, Jan 24, 2023 at 9:38 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote: > > > > > > בתאריך יום ג׳, 24 בינו׳ 2023 ב-19:24 מאת Alexei Starovoitov > > > <alexei.starovoitov@xxxxxxxxx>: > > > > > > > > On Tue, Jan 24, 2023 at 7:47 AM Yaniv Agman <yanivagman@xxxxxxxxx> wrote: > > > > > > > > > > בתאריך יום ג׳, 24 בינו׳ 2023 ב-14:30 מאת Alexei Starovoitov > > > > > <alexei.starovoitov@xxxxxxxxx>: > > > > > > > > > > > > On Mon, Jan 23, 2023 at 2:03 PM Yaniv Agman <yanivagman@xxxxxxxxx> wrote: > > > > > > > > > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-23:25 מאת Jakub Sitnicki > > > > > > > <jakub@xxxxxxxxxxxxxx>: > > > > > > > > > > > > > > > > On Mon, Jan 23, 2023 at 11:01 PM +02, Yaniv Agman wrote: > > > > > > > > > בתאריך יום ב׳, 23 בינו׳ 2023 ב-22:06 מאת Martin KaFai Lau > > > > > > > > > <martin.lau@xxxxxxxxx>: > > > > > > > > >> > > > > > > > > >> On 1/23/23 9:32 AM, Yaniv Agman wrote: > > > > > > > > >> >>> interrupted the first one. But even then, I will need to find a way to > > > > > > > > >> >>> know if my program currently interrupts the run of another program - > > > > > > > > >> >>> is there a way to do that? > > > > > > > > >> May be a percpu atomic counter to see if the bpf prog has been re-entered on the > > > > > > > > >> same cpu. > > > > > > > > > > > > > > > > > > Not sure I understand how this will help. If I want to save local > > > > > > > > > program data on a percpu map and I see that the counter is bigger then > > > > > > > > > zero, should I ignore the event? > > > > > > > > > > > > > > > > map_update w/ BPF_F_LOCK disables preemption, if you're after updating > > > > > > > > an entry atomically. But it can't be used with PERCPU maps today. > > > > > > > > Perhaps that's needed now too. > > > > > > > > > > > > > > Yep. I think what is needed here is the ability to disable preemption > > > > > > > from the bpf program - maybe even adding a helper for that? > > > > > > > > > > > > I'm not sure what the issue is here. > > > > > > Old preempt_disable() doesn't mean that one bpf program won't ever > > > > > > be interrupted by another bpf prog. > > > > > > Like networking bpf prog in old preempt_disable can call into something > > > > > > where there is a kprobe and another tracing bpf prog will be called. > > > > > > The same can happen after we switched to migrate_disable. > > > > > > > > > > One difference here is that in what you describe the programmer can > > > > > know in advance which functions might call others and avoid that or > > > > > use other percpu maps, but if preemption can happen between functions > > > > > which are not related to one another (don't have a relation of caller > > > > > and callee), then the programmer can't have control over it > > > > > > > > Could you give a specific example? > > > > > > Sure. I can give two examples from places where we saw such corruption: > > > > > > 1. We use kprobes to trace some LSM hooks, e.g. security_file_open, > > > and a percpu scratch map to prepare the event for submit. When we also > > > added a TRACEPOINT to trace sched_process_free (where we also use this > > > scratch percpu map), the security_file_open events got corrupted and > > > we didn't know what was happening (was very hard to debug) > > > > bpf_lsm_file_open is sleepable. > > We never did preempt_disable there. > > kprobe on security_file_open works, but it's a bit of a hack. > > With preempt->migrate > > the delayed_put_task_struct->trace_sched_process_free can happen > > on the same cpu if prog gets preempted in preemtable kernel, but something > > is fishy here. > > Non-sleepable bpf progs always run in rcu cs while > > delayed_put_task_struct is call_rcu(), > > so if you've used a per-cpu scratch buffer for a duration > > of bpf prog (either old preempt_disable or new migrate_disable) > > the trace_sched_process_free won't be called until prog is finished. > > > > If you're using per-cpu scratch buffer to remember the data in one > > execution of kprobe-bpf and consuming in the next then all bets are off. > > This is going to break sooner or later. > > > > Yes, I agree this is fishy, but we only use the per-cpu scratch map > for the duration of a bpf program, assuming tail-calls are considered > to be part of the same program. > We've even spotted the place where preemption happens to be right > after calling bpf_perf_event_submit() helper by placing bpf_printk in > the security_file_open kprobe. Before the call to the helper the > buffer was ok, but after it got corrupted (and there were more lines > in the program after the call to this helper). I see. bpf_perf_event_output() may trigger irq_work->arch_irq_work_raise which will send an ipi to the current cpu which may resched. I'm still missing why kprobe in security_file_open has to share per-cpu scratch buffer with kprobe in trace_sched_process_free. Why not use a scratch buffer per program? > > > > 2. same was happening when kprobes were combined with cgroup_skb > > > programs to trace network events > > > > irqs can interrupt old preempt_disable, so depending on where kprobe-bpf > > is it can run while cgroup-bpf hasn't finished. > > > > Actually what we saw here is that the ingress skb program that runs > from ksoftirq/idle context corrupts a percpu map shared with a > raw_sys_exit tracepoint used to submit execve/openat events. > > > Then there are differences in kernel configs: no preempt, full preempt, > > preempt_rcu and differences in rcu_tasks_trace implementation > > over the years. You can only assume that the validity of the pointer > > to bpf per-cpu array and bpf per-cpu hash, but no assumption > > about concurrent access to the same map if you share the same map > > across different programs. > > I think that is the main point here. Thanks for clarifying this. > > > See BPF_F_LOCK earlier suggestion. It might be what you need. > > I am not sure how exactly this can help. The size of the percpu buffer > we use is 32kb and we don't use the map_update helper but the pointer > returned from the lookup to update the map in several places in the > program. It wasn't clear what you do with the scratch buffer. BPF_F_LOCK will help if user space wants to bpf_lookup_elem it without a race with bpf prog that would use bpf_spin_lock() to access the element. Looks like that's not the case you had in mind. You just stream these buffers to user space via bpf_perf_event_output(). Anyway back to preempt_disable(). Think of it as a giant spin_lock that covers the whole program. In preemptable kernels it hurts tail latency and fairness, and is completely unacceptable in RT. That's why we moved to migrate_disable. Technically we can add bpf_preempt_disable() kfunc, but if we do that we'll be back to square one. The issues with preemptions and RT will reappear. So let's figure out a different solution. Why not use a scratch buffer per program ?