Re: [PATCH bpf-next 1/2] bpf: try to avoid kzalloc in cgroup/{s,g}etsockopt

Song Liu <song@xxxxxxxxxx> · Mon, 21 Dec 2020 14:22:41 -0800

On Thu, Dec 17, 2020 at 9:24 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
>
> When we attach a bpf program to cgroup/getsockopt any other getsockopt()
> syscall starts incurring kzalloc/kfree cost. While, in general, it's
> not an issue, sometimes it is, like in the case of TCP_ZEROCOPY_RECEIVE.
> TCP_ZEROCOPY_RECEIVE (ab)uses getsockopt system call to implement
> fastpath for incoming TCP, we don't want to have extra allocations in
> there.
>
> Let add a small buffer on the stack and use it for small (majority)
> {s,g}etsockopt values. I've started with 128 bytes to cover
> the options we care about (TCP_ZEROCOPY_RECEIVE which is 32 bytes
> currently, with some planned extension to 64 + some headroom
> for the future).

I don't really know the rule of thumb, but 128 bytes on stack feels too big to
me. I would like to hear others' opinions on this. Can we solve the problem
with some other mechanisms, e.g. a mempool?

[...]

>
> +static void *sockopt_export_buf(struct bpf_sockopt_kern *ctx)
> +{
> +       void *p;
> +
> +       if (ctx->optval != ctx->buf)
> +               return ctx->optval;
> +
> +       /* We've used bpf_sockopt_kern->buf as an intermediary storage,
> +        * but the BPF program indicates that we need to pass this
> +        * data to the kernel setsockopt handler. No way to export
> +        * on-stack buf, have to allocate a new buffer. The caller
> +        * is responsible for the kfree().
> +        */
> +       p = kzalloc(ctx->optlen, GFP_USER);
> +       if (!p)
> +               return ERR_PTR(-ENOMEM);
> +       memcpy(p, ctx->optval, ctx->optlen);
> +       return p;
> +}
> +
>  int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
>                                        int *optname, char __user *optval,
>                                        int *optlen, char **kernel_optval)
> @@ -1389,8 +1420,14 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
>                  * use original userspace data.
>                  */
>                 if (ctx.optlen != 0) {
> -                       *optlen = ctx.optlen;
> -                       *kernel_optval = ctx.optval;
> +                       void *buf = sockopt_export_buf(&ctx);

I found it is hard to follow the logic here (when to allocate memory, how to
fail over, etc.). Do we have plan to reuse sockopt_export_buf()? If not, it is
probably cleaner to put the logic in __cgroup_bpf_run_filter_setsockopt()?

Thanks,
Song

[...]