On Fri, Jan 8, 2021 at 5:37 PM Martin KaFai Lau <kafai@xxxxxx> wrote: > > On Fri, Jan 08, 2021 at 01:02:21PM -0800, Stanislav Fomichev wrote: > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom > > call in do_tcp_getsockopt using the on-stack data. This removes > > 3% overhead for locking/unlocking the socket. > > > > Without this patch: > > 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt > > | > > --3.30%--__cgroup_bpf_run_filter_getsockopt > > | > > --0.81%--__kmalloc > > > > With the patch applied: > > 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern > > > > Signed-off-by: Stanislav Fomichev <sdf@xxxxxxxxxx> > > Cc: Martin KaFai Lau <kafai@xxxxxx> > > Cc: Song Liu <songliubraving@xxxxxx> > > Cc: Eric Dumazet <edumazet@xxxxxxxxxx> > > --- > > include/linux/bpf-cgroup.h | 27 +++++++++++-- > > include/linux/indirect_call_wrapper.h | 6 +++ > > include/net/sock.h | 2 + > > include/net/tcp.h | 1 + > > kernel/bpf/cgroup.c | 38 +++++++++++++++++++ > > net/ipv4/tcp.c | 14 +++++++ > > net/ipv4/tcp_ipv4.c | 1 + > > net/ipv6/tcp_ipv6.c | 1 + > > net/socket.c | 3 ++ > > .../selftests/bpf/prog_tests/sockopt_sk.c | 22 +++++++++++ > > .../testing/selftests/bpf/progs/sockopt_sk.c | 15 ++++++++ > > 11 files changed, 126 insertions(+), 4 deletions(-) > > > [ ... ] > > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > > index 6ec088a96302..c41bb2f34013 100644 > > --- a/kernel/bpf/cgroup.c > > +++ b/kernel/bpf/cgroup.c > > @@ -1485,6 +1485,44 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, > > sockopt_free_buf(&ctx); > > return ret; > > } > > + > > +int __cgroup_bpf_run_filter_getsockopt_kern(struct sock *sk, int level, > > + int optname, void *optval, > > + int *optlen, int retval) > > +{ > > + struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); > > + struct bpf_sockopt_kern ctx = { > > + .sk = sk, > > + .level = level, > > + .optname = optname, > > + .retval = retval, > > + .optlen = *optlen, > > + .optval = optval, > > + .optval_end = optval + *optlen, > > + }; > > + int ret; > > + > The current behavior only passes kernel optval to bpf prog when > retval == 0. Can you explain a few words here about > the difference and why it is fine? > Just in case some other options may want to reuse the > __cgroup_bpf_run_filter_getsockopt_kern() in the future. IIRC, whatever we do in __cgroup_bpf_run_filter_getsockopt with skipping the copy for retval != 0 is just an optimization. I was assuming that on the error, kernel wouldn't copy anything back to the users (not sure how true in real life it is). I'll add a comment here to signify the difference. > > + ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT], > > + &ctx, BPF_PROG_RUN); > > + if (!ret) > > + return -EPERM; > > + > > + if (ctx.optlen > *optlen) > > + return -EFAULT; > > + > > + /* BPF programs only allowed to set retval to 0, not some > > + * arbitrary value. > > + */ > > + if (ctx.retval != 0 && ctx.retval != retval) > > + return -EFAULT; > > + > > + /* BPF programs can shrink the buffer, export the modifications. > > + */ > > + if (ctx.optlen != 0) > > + *optlen = ctx.optlen; > > + > > + return ctx.retval; > > +} > > #endif > > > > static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp, > > [ ... ] > > > diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c > > index b25c9c45c148..6bb18b1d8578 100644 > > --- a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c > > +++ b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c > > @@ -11,6 +11,7 @@ static int getsetsockopt(void) > > char u8[4]; > > __u32 u32; > > char cc[16]; /* TCP_CA_NAME_MAX */ > > + struct tcp_zerocopy_receive zc; > I suspect it won't compile at least in my setup. > > However, I compile tools/testing/selftests/net/tcp_mmap.c fine though. > I _guess_ it is because the net's test has included kernel/usr/include. > > AFAIK, bpf's tests use tools/include/uapi/. > > Others LGTM. Sure, let me add export it to tools/include/uapi. I didn't do it because it also compiled for me and I assumed that tcp_zerocopy_receive was exported too long ago to care (we are using the first field anyway so don't really need the latest layout).