On Wed, Dec 15, 2021 at 10:54 AM Pavel Begunkov <asml.silence@xxxxxxxxx> wrote: > > On 12/15/21 18:24, sdf@xxxxxxxxxx wrote: > > On 12/15, Pavel Begunkov wrote: > >> On 12/15/21 17:33, sdf@xxxxxxxxxx wrote: > >> > On 12/15, Pavel Begunkov wrote: > >> > > On 12/15/21 16:51, sdf@xxxxxxxxxx wrote: > >> > > > On 12/15, Pavel Begunkov wrote: > >> > > > > � /* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */ > >> > > > > � #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)����������������� \ > >> > > > > � ({����������������������������������������� \ > >> > > > > ����� int __ret = 0;��������������������������������� \ > >> > > > > -��� if (cgroup_bpf_enabled(CGROUP_INET_INGRESS))������������� \ > >> > > > > +��� if (cgroup_bpf_enabled(CGROUP_INET_INGRESS) && sk &&������������� \ > >> > > > > +������� CGROUP_BPF_TYPE_ENABLED((sk), CGROUP_INET_INGRESS))���������� \ > >> > > > > >> > > > Why not add this __cgroup_bpf_run_filter_skb check to > >> > > > __cgroup_bpf_run_filter_skb? Result of sock_cgroup_ptr() is already there > >> > > > and you can use it. Maybe move the things around if you want > >> > > > it to happen earlier. > >> > > >> > > For inlining. Just wanted to get it done right, otherwise I'll likely be > >> > > returning to it back in a few months complaining that I see measurable > >> > > overhead from the function call :) > >> > > >> > Do you expect that direct call to bring any visible overhead? > >> > Would be nice to compare that inlined case vs > >> > __cgroup_bpf_prog_array_is_empty inside of __cgroup_bpf_run_filter_skb > >> > while you're at it (plus move offset initialization down?). > > > >> Sorry but that would be waste of time. I naively hope it will be visible > >> with net at some moment (if not already), that's how it was with io_uring, > >> that's what I see in the block layer. And in anyway, if just one inlined > >> won't make a difference, then 10 will. > > > > I can probably do more experiments on my side once your patch is > > accepted. I'm mostly concerned with getsockopt(TCP_ZEROCOPY_RECEIVE). > > If you claim there is visible overhead for a direct call then there > > should be visible benefit to using CGROUP_BPF_TYPE_ENABLED there as > > well. > > Interesting, sounds getsockopt might be performance sensitive to > someone. > > FWIW, I forgot to mention that for testing tx I'm using io_uring > (for both zc and not) with good submission batching. Yeah, last time I saw 2-3% as well, but it was due to kmalloc, see more details in 9cacf81f8161, it was pretty visible under perf. That's why I'm a bit skeptical of your claims of direct calls being somehow visible in these 2-3% (even skb pulls/pushes are not 2-3%?). But tbf I don't understand how it all plays out with the io_uring. (mostly trying to understand where there is some gain left on the table for TCP_ZEROCOPY_RECEIVE).