On Mon, Jan 4, 2021 at 4:03 PM Martin KaFai Lau <kafai@xxxxxx> wrote: > > On Mon, Jan 04, 2021 at 02:14:53PM -0800, Stanislav Fomichev wrote: > > When we attach a bpf program to cgroup/getsockopt any other getsockopt() > > syscall starts incurring kzalloc/kfree cost. While, in general, it's > > not an issue, sometimes it is, like in the case of TCP_ZEROCOPY_RECEIVE. > > TCP_ZEROCOPY_RECEIVE (ab)uses getsockopt system call to implement > > fastpath for incoming TCP, we don't want to have extra allocations in > > there. > > > > Let add a small buffer on the stack and use it for small (majority) > > {s,g}etsockopt values. I've started with 128 bytes to cover > > the options we care about (TCP_ZEROCOPY_RECEIVE which is 32 bytes > > currently, with some planned extension to 64). > > > > It seems natural to do the same for setsockopt, but it's a bit more > > involved when the BPF program modifies the data (where we have to > > kmalloc). The assumption is that for the majority of setsockopt > > calls (which are doing pure BPF options or apply policy) this > > will bring some benefit as well. > > > > Collected some performance numbers using (on a 65k MTU localhost in a VM): > > $ perf record -g -- ./tcp_mmap -s -z > > $ ./tcp_mmap -H ::1 -z > > $ ... > > $ perf report --symbol-filter=__cgroup_bpf_run_filter_getsockopt > > > > Without this patch: > > 4.81% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_> > > | > > --4.74%--__cgroup_bpf_run_filter_getsockopt > > | > > |--1.06%--__kmalloc > > | > > |--0.71%--lock_sock_nested > > | > > |--0.62%--__might_fault > > | > > --0.52%--release_sock > > > > With the patch applied: > > 3.29% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt > > | > > --3.22%--__cgroup_bpf_run_filter_getsockopt > > | > > |--0.66%--lock_sock_nested > > | > > |--0.57%--__might_fault > > | > > --0.56%--release_sock > > > > So it saves about 1% of the system call. Unfortunately, we still get > > 2-3% of overhead due to another socket lock/unlock :-( > That could be a future exercise to optimize the fast path sockopts. ;) Yeah, I couldn't think about anything simple so far. The only idea I have is to allow custom implementation for tcp/udp (where we do lock_sock) and then have existing BPF_CGROUP_RUN_PROG_{S,G}ETSOCKOPT in net/socket.c as a fallback. Need to experiment more with it. > > --- a/kernel/bpf/cgroup.c > > +++ b/kernel/bpf/cgroup.c > > @@ -16,6 +16,7 @@ > > #include <linux/bpf-cgroup.h> > > #include <net/sock.h> > > #include <net/bpf_sk_storage.h> > > +#include <net/tcp.h> /* sizeof(struct tcp_zerocopy_receive) */ > To be more specific, it should be <uapi/linux/tcp.h>. Sure, let's do that. I went with net/tcp.h because most of the code under net/* doesn't include uapi directly. > > > > #include "../cgroup/cgroup-internal.h" > > > > @@ -1298,6 +1299,7 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, > > return empty; > > } > > > > + > Extra newline. Oops, thanks, will fix. > > static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) > > { > > if (unlikely(max_optlen < 0)) > > @@ -1310,6 +1312,18 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) > > max_optlen = PAGE_SIZE; > > } > > > > + if (max_optlen <= sizeof(ctx->buf)) { > > + /* When the optval fits into BPF_SOCKOPT_KERN_BUF_SIZE > > + * bytes avoid the cost of kzalloc. > > + */ > If it needs to respin, it will be good to have a few words here on why > it only BUILD_BUG checks for "struct tcp_zerocopy_receive". Sounds good, will add. I'll wait a day to let others comment and will respin. > > + BUILD_BUG_ON(sizeof(struct tcp_zerocopy_receive) > > > + BPF_SOCKOPT_KERN_BUF_SIZE); > > + > > + ctx->optval = ctx->buf; > > + ctx->optval_end = ctx->optval + max_optlen; > > + return max_optlen; > > + } > > +