Re: [PATCH bpf v5 1/3] bpf: don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE

sdf@xxxxxxxxxx · Wed, 17 Jun 2020 10:45:08 -0700

On 06/17, Alexei Starovoitov wrote:
On Tue, Jun 16, 2020 at 06:04:14PM -0700, Stanislav Fomichev wrote:
> Attaching to these hooks can break iptables because its optval is
> usually quite big, or at least bigger than the current PAGE_SIZE limit.
> David also mentioned some SCTP options can be big (around 256k).
>
> For such optvals we expose only the first PAGE_SIZE bytes to
> the BPF program. BPF program has two options:
> 1. Set ctx->optlen to 0 to indicate that the BPF's optval
>    should be ignored and the kernel should use original userspace
>    value.
> 2. Set ctx->optlen to something that's smaller than the PAGE_SIZE.
>
> v5:
> * use ctx->optlen == 0 with trimmed buffer (Alexei Starovoitov)
> * update the docs accordingly
>
> v4:
> * use temporary buffer to avoid optval == optval_end == NULL;
>   this removes the corner case in the verifier that might assume
>   non-zero PTR_TO_PACKET/PTR_TO_PACKET_END.
>
> v3:
> * don't increase the limit, bypass the argument
>
> v2:
> * proper comments formatting (Jakub Kicinski)
>
> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
> Cc: David Laight <David.Laight@xxxxxxxxxx>
> Signed-off-by: Stanislav Fomichev <sdf@xxxxxxxxxx>
> ---
>  kernel/bpf/cgroup.c | 53 ++++++++++++++++++++++++++++-----------------
>  1 file changed, 33 insertions(+), 20 deletions(-)
>
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 4d76f16524cc..ac53102e244a 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -1276,16 +1276,23 @@ static bool  
__cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp,
>
>  static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int  
max_optlen)
>  {
> -	if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0)
> +	if (unlikely(max_optlen < 0))
>  		return -EINVAL;
>
> +	if (unlikely(max_optlen > PAGE_SIZE)) {
> +		/* We don't expose optvals that are greater than PAGE_SIZE
> +		 * to the BPF program.
> +		 */
> +		max_optlen = PAGE_SIZE;
> +	}
> +
>  	ctx->optval = kzalloc(max_optlen, GFP_USER);
>  	if (!ctx->optval)
>  		return -ENOMEM;
>
>  	ctx->optval_end = ctx->optval + max_optlen;
>
> -	return 0;
> +	return max_optlen;
>  }
>
>  static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
> @@ -1319,13 +1326,13 @@ int __cgroup_bpf_run_filter_setsockopt(struct  
sock *sk, int *level,
>  	 */
>  	max_optlen = max_t(int, 16, *optlen);
>
> -	ret = sockopt_alloc_buf(&ctx, max_optlen);
> -	if (ret)
> -		return ret;
> +	max_optlen = sockopt_alloc_buf(&ctx, max_optlen);
> +	if (max_optlen < 0)
> +		return max_optlen;
>
>  	ctx.optlen = *optlen;
>
> -	if (copy_from_user(ctx.optval, optval, *optlen) != 0) {
> +	if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) !=  
0) {
>  		ret = -EFAULT;
>  		goto out;
>  	}
> @@ -1353,8 +1360,14 @@ int __cgroup_bpf_run_filter_setsockopt(struct  
sock *sk, int *level,
>  		/* export any potential modifications */
>  		*level = ctx.level;
>  		*optname = ctx.optname;
> -		*optlen = ctx.optlen;
> -		*kernel_optval = ctx.optval;
> +
> +		/* optlen == 0 from BPF indicates that we should
> +		 * use original userspace data.
> +		 */
> +		if (ctx.optlen != 0) {
> +			*optlen = ctx.optlen;

I think it should be:
*optlen = min(ctx.optlen, max_optlen);
We do have the following (existing) check above:
	} else if (ctx.optlen > max_optlen || ctx.optlen < -1) {
		/* optlen is out of bounds */
		ret = -EFAULT;
	} else {

So we shouldn't need any min here? Or am I missing something?

Otherwise when bpf prog doesn't adjust ctx.oplen the kernel will see
4k only in kernel_optval whereas optlen will be > 4k.
I suspect iptables sockopt should have crashed at this point.
How did you test it?
The selftests that I've attached in the series. The test is passing
two pages and for IP_TOS we bypass the value via optlen=0 and
for IP_FREEBIND we trim the buffer to 1 byte. I think this should
cover this check here.

One thing I didn't really test is getsockopt when the kernel
returns really large buffer (iptables). Right now, the test
gets 4 bytes (trimmed) from the kernel. I think that's the only
place that I didn't properly test. I wonder whether I should
do a real iptables-like setsockopt/getsockopt :-/