On 12/15/21 19:15, Stanislav Fomichev wrote:
On Wed, Dec 15, 2021 at 10:54 AM Pavel Begunkov <asml.silence@xxxxxxxxx> wrote:
On 12/15/21 18:24, sdf@xxxxxxxxxx wrote:
On 12/15, Pavel Begunkov wrote:
On 12/15/21 17:33, sdf@xxxxxxxxxx wrote:
On 12/15, Pavel Begunkov wrote:
On 12/15/21 16:51, sdf@xxxxxxxxxx wrote:
On 12/15, Pavel Begunkov wrote:
� /* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
� #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)����������������� \
� ({����������������������������������������� \
����� int __ret = 0;��������������������������������� \
-��� if (cgroup_bpf_enabled(CGROUP_INET_INGRESS))������������� \
+��� if (cgroup_bpf_enabled(CGROUP_INET_INGRESS) && sk &&������������� \
+������� CGROUP_BPF_TYPE_ENABLED((sk), CGROUP_INET_INGRESS))���������� \
Why not add this __cgroup_bpf_run_filter_skb check to
__cgroup_bpf_run_filter_skb? Result of sock_cgroup_ptr() is already there
and you can use it. Maybe move the things around if you want
it to happen earlier.
For inlining. Just wanted to get it done right, otherwise I'll likely be
returning to it back in a few months complaining that I see measurable
overhead from the function call :)
Do you expect that direct call to bring any visible overhead?
Would be nice to compare that inlined case vs
__cgroup_bpf_prog_array_is_empty inside of __cgroup_bpf_run_filter_skb
while you're at it (plus move offset initialization down?).
Sorry but that would be waste of time. I naively hope it will be visible
with net at some moment (if not already), that's how it was with io_uring,
that's what I see in the block layer. And in anyway, if just one inlined
won't make a difference, then 10 will.
I can probably do more experiments on my side once your patch is
accepted. I'm mostly concerned with getsockopt(TCP_ZEROCOPY_RECEIVE).
If you claim there is visible overhead for a direct call then there
should be visible benefit to using CGROUP_BPF_TYPE_ENABLED there as
well.
Interesting, sounds getsockopt might be performance sensitive to
someone.
FWIW, I forgot to mention that for testing tx I'm using io_uring
(for both zc and not) with good submission batching.
Yeah, last time I saw 2-3% as well, but it was due to kmalloc, see
more details in 9cacf81f8161, it was pretty visible under perf.
That's why I'm a bit skeptical of your claims of direct calls being
somehow visible in these 2-3% (even skb pulls/pushes are not 2-3%?).
migrate_disable/enable together were taking somewhat in-between
1% and 1.5% in profiling, don't remember the exact number. The rest
should be from rcu_read_lock/unlock() in BPF_PROG_RUN_ARRAY_CG_FLAGS()
and other extra bits on the way.
I'm skeptical I'll be able to measure inlining one function,
variability between boots/runs is usually greater and would hide it.
But tbf I don't understand how it all plays out with the io_uring.
1 syscall per N requests (N=32 IIRC), 1 fdget() per N, no payload
page referencing (for zc), and so on
(mostly trying to understand where there is some gain left on the
table for TCP_ZEROCOPY_RECEIVE).
--
Pavel Begunkov