Re: [PATCH bpf-next v2 1/2] bpf: try to avoid kzalloc in cgroup/{s,g}etsockopt

Martin KaFai Lau <kafai@xxxxxx> · Mon, 4 Jan 2021 16:03:29 -0800



On Mon, Jan 04, 2021 at 02:14:53PM -0800, Stanislav Fomichev wrote:
> When we attach a bpf program to cgroup/getsockopt any other getsockopt()
> syscall starts incurring kzalloc/kfree cost. While, in general, it's
> not an issue, sometimes it is, like in the case of TCP_ZEROCOPY_RECEIVE.
> TCP_ZEROCOPY_RECEIVE (ab)uses getsockopt system call to implement
> fastpath for incoming TCP, we don't want to have extra allocations in
> there.
> 
> Let add a small buffer on the stack and use it for small (majority)
> {s,g}etsockopt values. I've started with 128 bytes to cover
> the options we care about (TCP_ZEROCOPY_RECEIVE which is 32 bytes
> currently, with some planned extension to 64).
> 
> It seems natural to do the same for setsockopt, but it's a bit more
> involved when the BPF program modifies the data (where we have to
> kmalloc). The assumption is that for the majority of setsockopt
> calls (which are doing pure BPF options or apply policy) this
> will bring some benefit as well.
> 
> Collected some performance numbers using (on a 65k MTU localhost in a VM):
> $ perf record -g -- ./tcp_mmap -s -z
> $ ./tcp_mmap -H ::1 -z
> $ ...
> $ perf report --symbol-filter=__cgroup_bpf_run_filter_getsockopt
> 
> Without this patch:
>      4.81%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_>
>             |
>              --4.74%--__cgroup_bpf_run_filter_getsockopt
>                        |
>                        |--1.06%--__kmalloc
>                        |
>                        |--0.71%--lock_sock_nested
>                        |
>                        |--0.62%--__might_fault
>                        |
>                         --0.52%--release_sock
> 
> With the patch applied:
>      3.29%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
>             |
>              --3.22%--__cgroup_bpf_run_filter_getsockopt
>                        |
>                        |--0.66%--lock_sock_nested
>                        |
>                        |--0.57%--__might_fault
>                        |
>                         --0.56%--release_sock
> 
> So it saves about 1% of the system call. Unfortunately, we still get
> 2-3% of overhead due to another socket lock/unlock :-(
That could be a future exercise to optimize the fast path sockopts. ;)

[ ... ]

> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -16,6 +16,7 @@
>  #include <linux/bpf-cgroup.h>
>  #include <net/sock.h>
>  #include <net/bpf_sk_storage.h>
> +#include <net/tcp.h> /* sizeof(struct tcp_zerocopy_receive) */
To be more specific, it should be <uapi/linux/tcp.h>.

>  
>  #include "../cgroup/cgroup-internal.h"
>  
> @@ -1298,6 +1299,7 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp,
>  	return empty;
>  }
>  
> +
Extra newline.

>  static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
>  {
>  	if (unlikely(max_optlen < 0))
> @@ -1310,6 +1312,18 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
>  		max_optlen = PAGE_SIZE;
>  	}
>  
> +	if (max_optlen <= sizeof(ctx->buf)) {
> +		/* When the optval fits into BPF_SOCKOPT_KERN_BUF_SIZE
> +		 * bytes avoid the cost of kzalloc.
> +		 */
If it needs to respin, it will be good to have a few words here on why
it only BUILD_BUG checks for "struct tcp_zerocopy_receive".

> +		BUILD_BUG_ON(sizeof(struct tcp_zerocopy_receive) >
> +			     BPF_SOCKOPT_KERN_BUF_SIZE);
> +
> +		ctx->optval = ctx->buf;
> +		ctx->optval_end = ctx->optval + max_optlen;
> +		return max_optlen;
> +	}
> +