> On Aug 4, 2020, at 10:47 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Tue, Aug 4, 2020 at 9:47 PM Song Liu <songliubraving@xxxxxx> wrote: >> >> >> >>> On Aug 4, 2020, at 6:52 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: >>> >>> On Tue, Aug 4, 2020 at 2:01 PM Song Liu <songliubraving@xxxxxx> wrote: >>>> >>>> >>>> >>>>> On Aug 2, 2020, at 10:10 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: >>>>> >>>>> On Sun, Aug 2, 2020 at 9:47 PM Song Liu <songliubraving@xxxxxx> wrote: >>>>>> >>>>>> >>>>>>> On Aug 2, 2020, at 6:51 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: >>>>>>> >>>>>>> On Sat, Aug 1, 2020 at 1:50 AM Song Liu <songliubraving@xxxxxx> wrote: >>>>>>>> >>>>>>>> Add a benchmark to compare performance of >>>>>>>> 1) uprobe; >>>>>>>> 2) user program w/o args; >>>>>>>> 3) user program w/ args; >>>>>>>> 4) user program w/ args on random cpu. >>>>>>>> >>>>>>> >>>>>>> Can you please add it to the existing benchmark runner instead, e.g., >>>>>>> along the other bench_trigger benchmarks? No need to re-implement >>>>>>> benchmark setup. And also that would also allow to compare existing >>>>>>> ways of cheaply triggering a program vs this new _USER program? >>>>>> >>>>>> Will try. >>>>>> >>>>>>> >>>>>>> If the performance is not significantly better than other ways, do you >>>>>>> think it still makes sense to add a new BPF program type? I think >>>>>>> triggering KPROBE/TRACEPOINT from bpf_prog_test_run() would be very >>>>>>> nice, maybe it's possible to add that instead of a new program type? >>>>>>> Either way, let's see comparison with other program triggering >>>>>>> mechanisms first. >>>>>> >>>>>> Triggering KPROBE and TRACEPOINT from bpf_prog_test_run() will be useful. >>>>>> But I don't think they can be used instead of user program, for a couple >>>>>> reasons. First, KPROBE/TRACEPOINT may be triggered by other programs >>>>>> running in the system, so user will have to filter those noise out in >>>>>> each program. Second, it is not easy to specify CPU for KPROBE/TRACEPOINT, >>>>>> while this feature could be useful in many cases, e.g. get stack trace >>>>>> on a given CPU. >>>>>> >>>>> >>>>> Right, it's not as convenient with KPROBE/TRACEPOINT as with the USER >>>>> program you've added specifically with that feature in mind. But if >>>>> you pin user-space thread on the needed CPU and trigger kprobe/tp, >>>>> then you'll get what you want. As for the "noise", see how >>>>> bench_trigger() deals with that: it records thread ID and filters >>>>> everything not matching. You can do the same with CPU ID. It's not as >>>>> automatic as with a special BPF program type, but still pretty simple, >>>>> which is why I'm still deciding (for myself) whether USER program type >>>>> is necessary :) >>>> >>>> Here are some bench_trigger numbers: >>>> >>>> base : 1.698 ± 0.001M/s >>>> tp : 1.477 ± 0.001M/s >>>> rawtp : 1.567 ± 0.001M/s >>>> kprobe : 1.431 ± 0.000M/s >>>> fentry : 1.691 ± 0.000M/s >>>> fmodret : 1.654 ± 0.000M/s >>>> user : 1.253 ± 0.000M/s >>>> fentry-on-cpu: 0.022 ± 0.011M/s >>>> user-on-cpu: 0.315 ± 0.001M/s >>>> >>> >>> Ok, so basically all of raw_tp,tp,kprobe,fentry/fexit are >>> significantly faster than USER programs. Sure, when compared to >>> uprobe, they are faster, but not when doing on-specific-CPU run, it >>> seems (judging from this patch's description, if I'm reading it >>> right). Anyways, speed argument shouldn't be a reason for doing this, >>> IMO. >>> >>>> The two "on-cpu" tests run the program on a different CPU (see the patch >>>> at the end). >>>> >>>> "user" is about 25% slower than "fentry". I think this is mostly because >>>> getpgid() is a faster syscall than bpf(BPF_TEST_RUN). >>> >>> Yes, probably. >>> >>>> >>>> "user-on-cpu" is more than 10x faster than "fentry-on-cpu", because IPI >>>> is way faster than moving the process (via sched_setaffinity). >>> >>> I don't think that's a good comparison, because you are actually >>> testing sched_setaffinity performance on each iteration vs IPI in the >>> kernel, not a BPF overhead. >>> >>> I think the fair comparison for this would be to create a thread and >>> pin it on necessary CPU, and only then BPF program calls in a loop. >>> But I bet any of existing program types would beat USER program. >>> >>>> >>>> For use cases that we would like to call BPF program on specific CPU, >>>> triggering it via IPI is a lot faster. >>> >>> So these use cases would be nice to expand on in the motivational part >>> of the patch set. It's not really emphasized and it's not at all clear >>> what you are trying to achieve. It also seems, depending on latency >>> requirements, it's totally possible to achieve comparable results by >>> pre-creating a thread for each CPU, pinning each one to its designated >>> CPU and then using any suitable user-space signaling mechanism (a >>> queue, condvar, etc) to ask a thread to trigger BPF program (fentry on >>> getpgid(), for instance). >> >> I don't see why user space signal plus fentry would be faster than IPI. >> If the target cpu is running something, this gonna add two context >> switches. >> > > I didn't say faster, did I? I said it would be comparable and wouldn't > require a new program type. Well, I don't think adding program type is that big a deal. If that is really a problem, we can use a new attach type instead. The goal is to trigger it with sys_bpf() on a different cpu. So we can call it kprobe attach to nothing and hack that way. I add the new type because it makes sense. The user just want to trigger a BPF program from user space. > But then again, without knowing all the > details, it's a bit hard to discuss this. E.g., if you need to trigger > that BPF program periodically, you can sleep in those per-CPU threads, > or epoll, or whatever. Or maybe you can set up a per-CPU perf event > that would trigger your program on the desired CPU, etc.My point is > that I and others shouldn't be guessing this, I'd expect someone who's > proposing an entire new BPF program type to motivate why this new > program type is necessary and what problem it's solving that can't be > solved with existing means. Yes, there are other options. But they all come with non-trivial cost. Per-CPU-per-process threads and/or per-CPU perf event are cost we have to pay in production. IMO, these cost are much higher than a new program type (or attach type). > > BTW, how frequently do you need to trigger the BPF program? Seems very > frequently, if 2 vs 1 context switches might be a problem? The whole solution requires two BPF programs. One on each context switch, the other is the user program. The user program will not trigger very often. > >>> I bet in this case the performance would be >>> really nice for a lot of practical use cases. But then again, I don't >>> know details of the intended use case, so please provide some more >>> details. >> >> Being able to trigger BPF program on a different CPU could enable many >> use cases and optimizations. The use case I am looking at is to access >> perf_event and percpu maps on the target CPU. For example: >> 0. trigger the program >> 1. read perf_event on cpu x; >> 2. (optional) check which process is running on cpu x; >> 3. add perf_event value to percpu map(s) on cpu x. >> >> If we do these steps in a BPF program on cpu x, the cost is: >> A.0) trigger BPF via IPI; >> A.1) read perf_event locally; >> A.2) local access current; >> A.3) local access of percpu map(s). >> >> If we can only do these on a different CPU, the cost will be: >> B.0) trigger BPF locally; >> B.1) read perf_event via IPI; >> B.2) remote access current on cpu x; >> B.3) remote access percpu map(s), or use non-percpu map(2). >> >> Cost of (A.0 + A.1) is about same as (B.0 + B.1), maybe a little higher >> (sys_bpf(), vs. sys_getpgid()). But A.2 and A.3 will be significantly >> cheaper than B.2 and B.3. >> >> Does this make sense? > > It does, thanks. But what I was describing is still A, no? BPF program > will be triggered on your desired cpu X, wouldn't it? Well, that would be option C, but C could not do step 2, because we context switch to the dedicated thread.