Re: [PATCH bpf-next v4 1/8] bpf: implement getsockopt and setsockopt hooks

Martin Lau <kafai@xxxxxx> · Mon, 10 Jun 2019 19:55:44 +0000



On Mon, Jun 10, 2019 at 12:35:58PM -0700, Stanislav Fomichev wrote:
> On 06/10, Martin Lau wrote:
> > On Mon, Jun 10, 2019 at 09:34:14AM -0700, Stanislav Fomichev wrote:
> > > Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
> > > BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
> > > 
> > > BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
> > > BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
> > > Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.
> > > 
> > > The buffer memory is pre-allocated (because I don't think there is
> > > a precedent for working with __user memory from bpf). This might be
> > > slow to do for each {s,g}etsockopt call, that's why I've added
> > > __cgroup_bpf_prog_array_is_empty that exits early if there is nothing
> > > attached to a cgroup. Note, however, that there is a race between
> > > __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
> > > program layout might have changed; this should not be a problem
> > > because in general there is a race between multiple calls to
> > > {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
> > > 
> > > The return code of the BPF program is handled as follows:
> > > * 0: EPERM
> > > * 1: success, execute kernel {s,g}etsockopt path after BPF prog exits
> > > * 2: success, do _not_ execute kernel {s,g}etsockopt path after BPF
> > >      prog exits
> > > 
> > > v4:
> > > * don't export bpf_sk_fullsock helper (Martin Lau)
> > > * size != sizeof(__u64) for uapi pointers (Martin Lau)
> > > * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
> > > 
> > > v3:
> > > * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
> > > * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
> > >   Nakryiko)
> > > * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
> > > * use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
> > > * new CG_SOCKOPT_ACCESS macro to wrap repeated parts
> > > 
> > > v2:
> > > * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
> > > * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
> > > * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
> > > * added [0,2] return code check to verifier (Martin Lau)
> > > * dropped unused buf[64] from the stack (Martin Lau)
> > > * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
> > > * dropped bpf_target_off from ctx rewrites (Martin Lau)
> > > * use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
> > > 
> > > Cc: Martin Lau <kafai@xxxxxx>
> > > Acked-by: Andrii Nakryiko <andriin@xxxxxx>
> > > Signed-off-by: Stanislav Fomichev <sdf@xxxxxxxxxx>
> > > ---
> > >  include/linux/bpf-cgroup.h |  29 ++++
> > >  include/linux/bpf.h        |  45 +++++++
> > >  include/linux/bpf_types.h  |   1 +
> > >  include/linux/filter.h     |  13 ++
> > >  include/uapi/linux/bpf.h   |  13 ++
> > >  kernel/bpf/cgroup.c        | 262 +++++++++++++++++++++++++++++++++++++
> > >  kernel/bpf/core.c          |   9 ++
> > >  kernel/bpf/syscall.c       |  19 +++
> > >  kernel/bpf/verifier.c      |  15 +++
> > >  net/core/filter.c          |   2 +-
> > >  net/socket.c               |  18 +++
> > >  11 files changed, 425 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> > > index b631ee75762d..406f1ba82531 100644
> > > --- a/include/linux/bpf-cgroup.h
> > > +++ b/include/linux/bpf-cgroup.h
> > > @@ -124,6 +124,13 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> > >  				   loff_t *ppos, void **new_buf,
> > >  				   enum bpf_attach_type type);
> > >  
> > > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int level,
> > > +				       int optname, char __user *optval,
> > > +				       unsigned int optlen);
> > > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sock, int level,
> > > +				       int optname, char __user *optval,
> > > +				       int __user *optlen);
> > > +
> > >  static inline enum bpf_cgroup_storage_type cgroup_storage_type(
> > >  	struct bpf_map *map)
> > >  {
> > > @@ -280,6 +287,26 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
> > >  	__ret;								       \
> > >  })
> > >  
> > > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen)   \
> > > +({									       \
> > > +	int __ret = 0;							       \
> > > +	if (cgroup_bpf_enabled)						       \
> > > +		__ret = __cgroup_bpf_run_filter_setsockopt(sock, level,	       \
> > > +							   optname, optval,    \
> > > +							   optlen);	       \
> > > +	__ret;								       \
> > > +})
> > > +
> > > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen)   \
> > > +({									       \
> > > +	int __ret = 0;							       \
> > > +	if (cgroup_bpf_enabled)						       \
> > > +		__ret = __cgroup_bpf_run_filter_getsockopt(sock, level,	       \
> > > +							   optname, optval,    \
> > > +							   optlen);	       \
> > > +	__ret;								       \
> > > +})
> > > +
> > >  int cgroup_bpf_prog_attach(const union bpf_attr *attr,
> > >  			   enum bpf_prog_type ptype, struct bpf_prog *prog);
> > >  int cgroup_bpf_prog_detach(const union bpf_attr *attr,
> > > @@ -349,6 +376,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
> > >  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
> > >  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
> > >  #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
> > > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > >  
> > >  #define for_each_cgroup_storage_type(stype) for (; false; )
> > >  
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index e5a309e6a400..194a47ca622f 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -520,6 +520,7 @@ struct bpf_prog_array {
> > >  struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags);
> > >  void bpf_prog_array_free(struct bpf_prog_array *progs);
> > >  int bpf_prog_array_length(struct bpf_prog_array *progs);
> > > +bool bpf_prog_array_is_empty(struct bpf_prog_array *array);
> > >  int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs,
> > >  				__u32 __user *prog_ids, u32 cnt);
> > >  
> > > @@ -606,6 +607,49 @@ _out:							\
> > >  		_ret;					\
> > >  	})
> > >  
> > > +/* To be used by BPF_PROG_TYPE_CGROUP_SOCKOPT program type.
> > > + *
> > > + * Expected BPF program return values are:
> > > + *   0: return -EPERM to the userspace
> > > + *   1: sockopt was not handled by BPF, kernel should do it
> > > + *   2: sockopt was handled by BPF, kernel should _not_ do it and return
> > > + *      to the userspace instead
> > > + *
> > > + * Note, that return '0' takes precedence over everything else. In other
> > > + * words, if any single program in the prog array has returned 0,
> > > + * the userspace will get -EPERM (regardless of what other programs
> > > + * return).
> > > + *
> > > + * The macro itself returns:
> > > + *        0: sockopt was not handled by BPF, kernel should do it
> > > + *        1: sockopt was handled by BPF, kernel should _not_ do it
> > > + *   -EPERM: return error back to userspace
> > > + */
> > > +#define BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY(array, ctx, func)		\
> > > +	({								\
> > > +		struct bpf_prog_array_item *_item;			\
> > > +		struct bpf_prog_array *_array;				\
> > > +		struct bpf_prog *_prog;					\
> > > +		u32 _success = 1;					\
> > > +		u32 _bypass = 0;					\
> > > +		u32 ret;						\
> > > +		preempt_disable();					\
> > > +		rcu_read_lock();					\
> > > +		_array = rcu_dereference(array);			\
> > > +		_item = &_array->items[0];				\
> > > +		while ((_prog = READ_ONCE(_item->prog))) {		\
> > > +			bpf_cgroup_storage_set(_item->cgroup_storage);	\
> > > +			ret = func(_prog, ctx);				\
> > > +			_success &= (ret > 0);				\
> > > +			_bypass |= (ret == 2);				\
> > > +			_item++;					\
> > > +		}							\
> > > +		rcu_read_unlock();					\
> > > +		preempt_enable();					\
> > > +		ret = _success ? _bypass : -EPERM;			\
> > > +		ret;							\
> > > +	})
> > > +
> > >  #define BPF_PROG_RUN_ARRAY(array, ctx, func)		\
> > >  	__BPF_PROG_RUN_ARRAY(array, ctx, func, false)
> > >  
> > > @@ -1054,6 +1098,7 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto;
> > >  extern const struct bpf_func_proto bpf_get_local_storage_proto;
> > >  extern const struct bpf_func_proto bpf_strtol_proto;
> > >  extern const struct bpf_func_proto bpf_strtoul_proto;
> > > +extern const struct bpf_func_proto bpf_tcp_sock_proto;
> > >  
> > >  /* Shared helpers among cBPF and eBPF. */
> > >  void bpf_user_rnd_init_once(void);
> > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > > index 5a9975678d6f..eec5aeeeaf92 100644
> > > --- a/include/linux/bpf_types.h
> > > +++ b/include/linux/bpf_types.h
> > > @@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
> > >  #ifdef CONFIG_CGROUP_BPF
> > >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
> > >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
> > > +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt)
> > >  #endif
> > >  #ifdef CONFIG_BPF_LIRC_MODE2
> > >  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > index 43b45d6db36d..6e64d01e4e36 100644
> > > --- a/include/linux/filter.h
> > > +++ b/include/linux/filter.h
> > > @@ -1199,4 +1199,17 @@ struct bpf_sysctl_kern {
> > >  	u64 tmp_reg;
> > >  };
> > >  
> > > +struct bpf_sockopt_kern {
> > > +	struct sock	*sk;
> > > +	u8		*optval;
> > > +	u8		*optval_end;
> > > +	s32		level;
> > > +	s32		optname;
> > > +	u32		optlen;
> > > +
> > > +	/* Small on-stack optval buffer to avoid small allocations.
> > > +	 */
> > > +	u8 buf[64] __aligned(8);
> > > +};
> > > +
> > >  #endif /* __LINUX_FILTER_H__ */
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 7c6aef253173..afaa7e28d1e4 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -170,6 +170,7 @@ enum bpf_prog_type {
> > >  	BPF_PROG_TYPE_FLOW_DISSECTOR,
> > >  	BPF_PROG_TYPE_CGROUP_SYSCTL,
> > >  	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
> > > +	BPF_PROG_TYPE_CGROUP_SOCKOPT,
> > >  };
> > >  
> > >  enum bpf_attach_type {
> > > @@ -192,6 +193,8 @@ enum bpf_attach_type {
> > >  	BPF_LIRC_MODE2,
> > >  	BPF_FLOW_DISSECTOR,
> > >  	BPF_CGROUP_SYSCTL,
> > > +	BPF_CGROUP_GETSOCKOPT,
> > > +	BPF_CGROUP_SETSOCKOPT,
> > >  	__MAX_BPF_ATTACH_TYPE
> > >  };
> > >  
> > > @@ -3533,4 +3536,14 @@ struct bpf_sysctl {
> > >  				 */
> > >  };
> > >  
> > > +struct bpf_sockopt {
> > > +	__bpf_md_ptr(struct bpf_sock *, sk);
> > > +	__bpf_md_ptr(void *, optval);
> > > +	__bpf_md_ptr(void *, optval_end);
> > > +
> > > +	__s32	level;
> > > +	__s32	optname;
> > > +	__u32	optlen;
> > > +};
> > > +
> > >  #endif /* _UAPI__LINUX_BPF_H__ */
> > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > > index 1b65ab0df457..dcc06edaad7b 100644
> > > --- a/kernel/bpf/cgroup.c
> > > +++ b/kernel/bpf/cgroup.c
> > > @@ -18,6 +18,7 @@
> > >  #include <linux/bpf.h>
> > >  #include <linux/bpf-cgroup.h>
> > >  #include <net/sock.h>
> > > +#include <net/bpf_sk_storage.h>
> > >  
> > >  DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
> > >  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> > > @@ -924,6 +925,142 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> > >  }
> > >  EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl);
> > >  
> > > +static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp,
> > > +					     enum bpf_attach_type attach_type)
> > > +{
> > > +	struct bpf_prog_array *prog_array;
> > > +	bool empty;
> > > +
> > > +	rcu_read_lock();
> > > +	prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]);
> > > +	empty = bpf_prog_array_is_empty(prog_array);
> > > +	rcu_read_unlock();
> > > +
> > > +	return empty;
> > > +}
> > > +
> > > +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> > > +{
> > > +	if (unlikely(max_optlen > PAGE_SIZE))
> > > +		return -EINVAL;
> > > +
> > > +	if (likely(max_optlen <= sizeof(ctx->buf))) {
> > > +		ctx->optval = ctx->buf;
> > > +	} else {
> > > +		ctx->optval = kzalloc(max_optlen, GFP_USER);
> > > +		if (!ctx->optval)
> > > +			return -ENOMEM;
> > > +	}
> > > +
> > > +	ctx->optval_end = ctx->optval + max_optlen;
> > > +	ctx->optlen = max_optlen;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
> > > +{
> > > +	if (unlikely(ctx->optval != ctx->buf))
> > > +		kfree(ctx->optval);
> > > +}
> > > +
> > > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int level,
> > > +				       int optname, char __user *optval,
> > > +				       unsigned int optlen)
> > > +{
> > > +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > +	struct bpf_sockopt_kern ctx = {
> > > +		.sk = sk,
> > > +		.level = level,
> > > +		.optname = optname,
> > > +	};
> > > +	int ret;
> > > +
> > > +	/* Opportunistic check to see whether we have any BPF program
> > > +	 * attached to the hook so we don't waste time allocating
> > > +	 * memory and locking the socket.
> > > +	 */
> > > +	if (__cgroup_bpf_prog_array_is_empty(cgrp, BPF_CGROUP_SETSOCKOPT))
> > > +		return 0;
> > > +
> > > +	ret = sockopt_alloc_buf(&ctx, optlen);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	if (copy_from_user(ctx.optval, optval, optlen) != 0) {
> > > +		sockopt_free_buf(&ctx);
> > > +		return -EFAULT;
> > > +	}
> > > +
> > > +	lock_sock(sk);
> > > +	ret = BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY(
> > > +		cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT],
> > > +		&ctx, BPF_PROG_RUN);
> > > +	release_sock(sk);
> > > +
> > > +	sockopt_free_buf(&ctx);
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt);
> > > +
> > > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
> > > +				       int optname, char __user *optval,
> > > +				       int __user *optlen)
> > > +{
> > > +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > +	struct bpf_sockopt_kern ctx = {
> > > +		.sk = sk,
> > > +		.level = level,
> > > +		.optname = optname,
> > > +	};
> > > +	int max_optlen;
> > > +	int ret;
> > > +
> > > +	/* Opportunistic check to see whether we have any BPF program
> > > +	 * attached to the hook so we don't waste time allocating
> > > +	 * memory and locking the socket.
> > > +	 */
> > > +	if (__cgroup_bpf_prog_array_is_empty(cgrp, BPF_CGROUP_GETSOCKOPT))
> > > +		return 0;
> > > +
> > > +	if (get_user(max_optlen, optlen))
> > > +		return -EFAULT;
> > > +
> > > +	ret = sockopt_alloc_buf(&ctx, max_optlen);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	lock_sock(sk);
> > > +	ret = BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY(
> > > +		cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
> > > +		&ctx, BPF_PROG_RUN);
> > > +	release_sock(sk);
> > > +
> > > +	if (ret < 0) {
> > > +		sockopt_free_buf(&ctx);
> > > +		return ret;
> > > +	}
> > > +
> > > +	if (ctx.optlen > max_optlen) {
> > v4 looks very good.
> > 
> > One minor question here, what may be the use case if the bpf_prog makes
> > ctx.optlen < max_optlen(==user's input) but sets ret == 0 (i.e. kernel should
> > continue to handle it)?
> Do you mean bpf fills in optval + adjusts optlen, but returns 0?
> I don't see any usecase for that, bpf's output would be ignored in that case.
Yes,

> 
> Do you think we should have an additional check to make sure
> ctx.optval data is zero and ctx.optlen is not changed in that case?
or skip the copy_to_user() and put_user() when ret == 0?

> 
> > > +		sockopt_free_buf(&ctx);
> > > +		return -EFAULT;
> > > +	}
> > > +
> > > +	if (copy_to_user(optval, ctx.optval, ctx.optlen) != 0) {
> > > +		sockopt_free_buf(&ctx);
> > > +		return -EFAULT;
> > > +	}
> > > +
> > > +	sockopt_free_buf(&ctx);
> > > +
> > > +	if (put_user(ctx.optlen, optlen))
> > > +		return -EFAULT;
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt);
> > > +