> On Aug 4, 2020, at 6:52 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Tue, Aug 4, 2020 at 2:01 PM Song Liu <songliubraving@xxxxxx> wrote: >> >> >> >>> On Aug 2, 2020, at 10:10 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: >>> >>> On Sun, Aug 2, 2020 at 9:47 PM Song Liu <songliubraving@xxxxxx> wrote: >>>> >>>> >>>>> On Aug 2, 2020, at 6:51 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: >>>>> >>>>> On Sat, Aug 1, 2020 at 1:50 AM Song Liu <songliubraving@xxxxxx> wrote: >>>>>> >>>>>> Add a benchmark to compare performance of >>>>>> 1) uprobe; >>>>>> 2) user program w/o args; >>>>>> 3) user program w/ args; >>>>>> 4) user program w/ args on random cpu. >>>>>> >>>>> >>>>> Can you please add it to the existing benchmark runner instead, e.g., >>>>> along the other bench_trigger benchmarks? No need to re-implement >>>>> benchmark setup. And also that would also allow to compare existing >>>>> ways of cheaply triggering a program vs this new _USER program? >>>> >>>> Will try. >>>> >>>>> >>>>> If the performance is not significantly better than other ways, do you >>>>> think it still makes sense to add a new BPF program type? I think >>>>> triggering KPROBE/TRACEPOINT from bpf_prog_test_run() would be very >>>>> nice, maybe it's possible to add that instead of a new program type? >>>>> Either way, let's see comparison with other program triggering >>>>> mechanisms first. >>>> >>>> Triggering KPROBE and TRACEPOINT from bpf_prog_test_run() will be useful. >>>> But I don't think they can be used instead of user program, for a couple >>>> reasons. First, KPROBE/TRACEPOINT may be triggered by other programs >>>> running in the system, so user will have to filter those noise out in >>>> each program. Second, it is not easy to specify CPU for KPROBE/TRACEPOINT, >>>> while this feature could be useful in many cases, e.g. get stack trace >>>> on a given CPU. >>>> >>> >>> Right, it's not as convenient with KPROBE/TRACEPOINT as with the USER >>> program you've added specifically with that feature in mind. But if >>> you pin user-space thread on the needed CPU and trigger kprobe/tp, >>> then you'll get what you want. As for the "noise", see how >>> bench_trigger() deals with that: it records thread ID and filters >>> everything not matching. You can do the same with CPU ID. It's not as >>> automatic as with a special BPF program type, but still pretty simple, >>> which is why I'm still deciding (for myself) whether USER program type >>> is necessary :) >> >> Here are some bench_trigger numbers: >> >> base : 1.698 ± 0.001M/s >> tp : 1.477 ± 0.001M/s >> rawtp : 1.567 ± 0.001M/s >> kprobe : 1.431 ± 0.000M/s >> fentry : 1.691 ± 0.000M/s >> fmodret : 1.654 ± 0.000M/s >> user : 1.253 ± 0.000M/s >> fentry-on-cpu: 0.022 ± 0.011M/s >> user-on-cpu: 0.315 ± 0.001M/s >> > > Ok, so basically all of raw_tp,tp,kprobe,fentry/fexit are > significantly faster than USER programs. Sure, when compared to > uprobe, they are faster, but not when doing on-specific-CPU run, it > seems (judging from this patch's description, if I'm reading it > right). Anyways, speed argument shouldn't be a reason for doing this, > IMO. > >> The two "on-cpu" tests run the program on a different CPU (see the patch >> at the end). >> >> "user" is about 25% slower than "fentry". I think this is mostly because >> getpgid() is a faster syscall than bpf(BPF_TEST_RUN). > > Yes, probably. > >> >> "user-on-cpu" is more than 10x faster than "fentry-on-cpu", because IPI >> is way faster than moving the process (via sched_setaffinity). > > I don't think that's a good comparison, because you are actually > testing sched_setaffinity performance on each iteration vs IPI in the > kernel, not a BPF overhead. > > I think the fair comparison for this would be to create a thread and > pin it on necessary CPU, and only then BPF program calls in a loop. > But I bet any of existing program types would beat USER program. > >> >> For use cases that we would like to call BPF program on specific CPU, >> triggering it via IPI is a lot faster. > > So these use cases would be nice to expand on in the motivational part > of the patch set. It's not really emphasized and it's not at all clear > what you are trying to achieve. It also seems, depending on latency > requirements, it's totally possible to achieve comparable results by > pre-creating a thread for each CPU, pinning each one to its designated > CPU and then using any suitable user-space signaling mechanism (a > queue, condvar, etc) to ask a thread to trigger BPF program (fentry on > getpgid(), for instance). I don't see why user space signal plus fentry would be faster than IPI. If the target cpu is running something, this gonna add two context switches. > I bet in this case the performance would be > really nice for a lot of practical use cases. But then again, I don't > know details of the intended use case, so please provide some more > details. Being able to trigger BPF program on a different CPU could enable many use cases and optimizations. The use case I am looking at is to access perf_event and percpu maps on the target CPU. For example: 0. trigger the program 1. read perf_event on cpu x; 2. (optional) check which process is running on cpu x; 3. add perf_event value to percpu map(s) on cpu x. If we do these steps in a BPF program on cpu x, the cost is: A.0) trigger BPF via IPI; A.1) read perf_event locally; A.2) local access current; A.3) local access of percpu map(s). If we can only do these on a different CPU, the cost will be: B.0) trigger BPF locally; B.1) read perf_event via IPI; B.2) remote access current on cpu x; B.3) remote access percpu map(s), or use non-percpu map(2). Cost of (A.0 + A.1) is about same as (B.0 + B.1), maybe a little higher (sys_bpf(), vs. sys_getpgid()). But A.2 and A.3 will be significantly cheaper than B.2 and B.3. Does this make sense? OTOH, I do agree we can trigger bpftrace BEGIN/END with sys_getpgid() or something similar. Thanks, Song