On Wed, Jun 17, 2020 at 10:45:08AM -0700, sdf@xxxxxxxxxx wrote: > On 06/17, Alexei Starovoitov wrote: > > On Tue, Jun 16, 2020 at 06:04:14PM -0700, Stanislav Fomichev wrote: > > > Attaching to these hooks can break iptables because its optval is > > > usually quite big, or at least bigger than the current PAGE_SIZE limit. > > > David also mentioned some SCTP options can be big (around 256k). > > > > > > For such optvals we expose only the first PAGE_SIZE bytes to > > > the BPF program. BPF program has two options: > > > 1. Set ctx->optlen to 0 to indicate that the BPF's optval > > > should be ignored and the kernel should use original userspace > > > value. > > > 2. Set ctx->optlen to something that's smaller than the PAGE_SIZE. > > > > > > v5: > > > * use ctx->optlen == 0 with trimmed buffer (Alexei Starovoitov) > > > * update the docs accordingly > > > > > > v4: > > > * use temporary buffer to avoid optval == optval_end == NULL; > > > this removes the corner case in the verifier that might assume > > > non-zero PTR_TO_PACKET/PTR_TO_PACKET_END. > > > > > > v3: > > > * don't increase the limit, bypass the argument > > > > > > v2: > > > * proper comments formatting (Jakub Kicinski) > > > > > > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > > > Cc: David Laight <David.Laight@xxxxxxxxxx> > > > Signed-off-by: Stanislav Fomichev <sdf@xxxxxxxxxx> > > > --- > > > kernel/bpf/cgroup.c | 53 ++++++++++++++++++++++++++++----------------- > > > 1 file changed, 33 insertions(+), 20 deletions(-) > > > > > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > > > index 4d76f16524cc..ac53102e244a 100644 > > > --- a/kernel/bpf/cgroup.c > > > +++ b/kernel/bpf/cgroup.c > > > @@ -1276,16 +1276,23 @@ static bool > > __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, > > > > > > static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int > > max_optlen) > > > { > > > - if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0) > > > + if (unlikely(max_optlen < 0)) > > > return -EINVAL; > > > > > > + if (unlikely(max_optlen > PAGE_SIZE)) { > > > + /* We don't expose optvals that are greater than PAGE_SIZE > > > + * to the BPF program. > > > + */ > > > + max_optlen = PAGE_SIZE; > > > + } > > > + > > > ctx->optval = kzalloc(max_optlen, GFP_USER); > > > if (!ctx->optval) > > > return -ENOMEM; > > > > > > ctx->optval_end = ctx->optval + max_optlen; > > > > > > - return 0; > > > + return max_optlen; > > > } > > > > > > static void sockopt_free_buf(struct bpf_sockopt_kern *ctx) > > > @@ -1319,13 +1326,13 @@ int __cgroup_bpf_run_filter_setsockopt(struct > > sock *sk, int *level, > > > */ > > > max_optlen = max_t(int, 16, *optlen); > > > > > > - ret = sockopt_alloc_buf(&ctx, max_optlen); > > > - if (ret) > > > - return ret; > > > + max_optlen = sockopt_alloc_buf(&ctx, max_optlen); > > > + if (max_optlen < 0) > > > + return max_optlen; > > > > > > ctx.optlen = *optlen; > > > > > > - if (copy_from_user(ctx.optval, optval, *optlen) != 0) { > > > + if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != > > 0) { > > > ret = -EFAULT; > > > goto out; > > > } > > > @@ -1353,8 +1360,14 @@ int __cgroup_bpf_run_filter_setsockopt(struct > > sock *sk, int *level, > > > /* export any potential modifications */ > > > *level = ctx.level; > > > *optname = ctx.optname; > > > - *optlen = ctx.optlen; > > > - *kernel_optval = ctx.optval; > > > + > > > + /* optlen == 0 from BPF indicates that we should > > > + * use original userspace data. > > > + */ > > > + if (ctx.optlen != 0) { > > > + *optlen = ctx.optlen; > > > I think it should be: > > *optlen = min(ctx.optlen, max_optlen); > We do have the following (existing) check above: > } else if (ctx.optlen > max_optlen || ctx.optlen < -1) { > /* optlen is out of bounds */ > ret = -EFAULT; > } else { > > So we shouldn't need any min here? Or am I missing something? ahh. you're right. Applied to bpf tree. > > Otherwise when bpf prog doesn't adjust ctx.oplen the kernel will see > > 4k only in kernel_optval whereas optlen will be > 4k. > > I suspect iptables sockopt should have crashed at this point. > > How did you test it? > The selftests that I've attached in the series. The test is passing > two pages and for IP_TOS we bypass the value via optlen=0 and > for IP_FREEBIND we trim the buffer to 1 byte. I think this should > cover this check here. > > One thing I didn't really test is getsockopt when the kernel > returns really large buffer (iptables). Right now, the test > gets 4 bytes (trimmed) from the kernel. I think that's the only > place that I didn't properly test. I wonder whether I should > do a real iptables-like setsockopt/getsockopt :-/ would be nice :)