On Wed, Dec 15, 2021 at 11:55 AM Pavel Begunkov <asml.silence@xxxxxxxxx> wrote: > > On 12/15/21 19:15, Stanislav Fomichev wrote: > > On Wed, Dec 15, 2021 at 10:54 AM Pavel Begunkov <asml.silence@xxxxxxxxx> wrote: > >> > >> On 12/15/21 18:24, sdf@xxxxxxxxxx wrote: > >>> On 12/15, Pavel Begunkov wrote: > >>>> On 12/15/21 17:33, sdf@xxxxxxxxxx wrote: > >>>>> On 12/15, Pavel Begunkov wrote: > >>>>>> On 12/15/21 16:51, sdf@xxxxxxxxxx wrote: > >>>>>>> On 12/15, Pavel Begunkov wrote: > >>>>>>>> � /* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */ > >>>>>>>> � #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)����������������� \ > >>>>>>>> � ({����������������������������������������� \ > >>>>>>>> ����� int __ret = 0;��������������������������������� \ > >>>>>>>> -��� if (cgroup_bpf_enabled(CGROUP_INET_INGRESS))������������� \ > >>>>>>>> +��� if (cgroup_bpf_enabled(CGROUP_INET_INGRESS) && sk &&������������� \ > >>>>>>>> +������� CGROUP_BPF_TYPE_ENABLED((sk), CGROUP_INET_INGRESS))���������� \ > >>>>>>> > >>>>>>> Why not add this __cgroup_bpf_run_filter_skb check to > >>>>>>> __cgroup_bpf_run_filter_skb? Result of sock_cgroup_ptr() is already there > >>>>>>> and you can use it. Maybe move the things around if you want > >>>>>>> it to happen earlier. > >>>>> > >>>>>> For inlining. Just wanted to get it done right, otherwise I'll likely be > >>>>>> returning to it back in a few months complaining that I see measurable > >>>>>> overhead from the function call :) > >>>>> > >>>>> Do you expect that direct call to bring any visible overhead? > >>>>> Would be nice to compare that inlined case vs > >>>>> __cgroup_bpf_prog_array_is_empty inside of __cgroup_bpf_run_filter_skb > >>>>> while you're at it (plus move offset initialization down?). > >>> > >>>> Sorry but that would be waste of time. I naively hope it will be visible > >>>> with net at some moment (if not already), that's how it was with io_uring, > >>>> that's what I see in the block layer. And in anyway, if just one inlined > >>>> won't make a difference, then 10 will. > >>> > >>> I can probably do more experiments on my side once your patch is > >>> accepted. I'm mostly concerned with getsockopt(TCP_ZEROCOPY_RECEIVE). > >>> If you claim there is visible overhead for a direct call then there > >>> should be visible benefit to using CGROUP_BPF_TYPE_ENABLED there as > >>> well. > >> > >> Interesting, sounds getsockopt might be performance sensitive to > >> someone. > >> > >> FWIW, I forgot to mention that for testing tx I'm using io_uring > >> (for both zc and not) with good submission batching. > > > > Yeah, last time I saw 2-3% as well, but it was due to kmalloc, see > > more details in 9cacf81f8161, it was pretty visible under perf. > > That's why I'm a bit skeptical of your claims of direct calls being > > somehow visible in these 2-3% (even skb pulls/pushes are not 2-3%?). > > migrate_disable/enable together were taking somewhat in-between > 1% and 1.5% in profiling, don't remember the exact number. The rest > should be from rcu_read_lock/unlock() in BPF_PROG_RUN_ARRAY_CG_FLAGS() > and other extra bits on the way. You probably have a preemptiple kernel and preemptible rcu which most likely explains why you see the overhead and I won't (non-preemptible kernel in our env, rcu_read_lock is essentially a nop, just a compiler barrier). > I'm skeptical I'll be able to measure inlining one function, > variability between boots/runs is usually greater and would hide it. Right, that's why I suggested to mirror what we do in set/getsockopt instead of the new extra CGROUP_BPF_TYPE_ENABLED. But I'll leave it up to you, Martin and the rest.