Re: [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc

Aditi Ghag <aditi.ghag@xxxxxxxxxxxxx> · Mon, 27 Feb 2023 07:32:23 -0800

> On Feb 27, 2023, at 6:56 AM, Aditi Ghag <aditi.ghag@xxxxxxxxxxxxx> wrote:
> 
> 
> 
>> On Feb 24, 2023, at 2:35 PM, Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
>> 
>> On 02/23, Aditi Ghag wrote:
>>> The socket destroy kfunc is used to forcefully terminate sockets from
>>> certain BPF contexts. We plan to use the capability in Cilium to force
>>> client sockets to reconnect when their remote load-balancing backends are
>>> deleted. The other use case is on-the-fly policy enforcement where existing
>>> socket connections prevented by policies need to be forcefully terminated.
>>> The helper allows terminating sockets that may or may not be actively
>>> sending traffic.
>> 
>>> The helper is currently exposed to certain BPF iterators where users can
>>> filter, and terminate selected sockets.  Additionally, the helper can only
>>> be called from these BPF contexts that ensure socket locking in order to
>>> allow synchronous execution of destroy helpers that also acquire socket
>>> locks. The previous commit that batches UDP sockets during iteration
>>> facilitated a synchronous invocation of the destroy helper from BPF context
>>> by skipping taking socket locks in the destroy handler. TCP iterators
>>> already supported batching.
>> 
>>> The helper takes `sock_common` type argument, even though it expects, and
>>> casts them to a `sock` pointer. This enables the verifier to allow the
>>> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
>>> `sock` structs. As a comparison, BPF helpers enable this behavior with the
>>> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
>>> option available with the verifier logic that handles kfuncs where BTF
>>> types are inferred. Furthermore, as `sock_common` only has a subset of
>>> certain fields of `sock`, casting pointer to the latter type might not
>>> always be safe. Hence, the BPF kfunc converts the argument to a full sock
>>> before casting.
>> 
>>> Signed-off-by: Aditi Ghag <aditi.ghag@xxxxxxxxxxxxx>
>>> ---
>>> net/core/filter.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
>>> net/ipv4/tcp.c    | 17 ++++++++++-----
>>> net/ipv4/udp.c    |  7 ++++--
>>> 3 files changed, 72 insertions(+), 7 deletions(-)
>> 
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 1d6f165923bf..79cd91ba13d0 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -11621,3 +11621,58 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>> 
>>> 	return func;
>>> }
>>> +
>>> +/* Disables missing prototype warnings */
>>> +__diag_push();
>>> +__diag_ignore_all("-Wmissing-prototypes",
>>> +		  "Global functions as their definitions will be in vmlinux BTF");
>>> +
>>> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
>>> + *
>>> + * The helper expects a non-NULL pointer to a full socket. It invokes
>>> + * the protocol specific socket destroy handlers.
>>> + *
>>> + * The helper can only be called from BPF contexts that have acquired the socket
>>> + * locks.
>>> + *
>>> + * Parameters:
>>> + * @sock: Pointer to socket to be destroyed
>>> + *
>>> + * Return:
>>> + * On error, may return EPROTONOSUPPORT, EINVAL.
>>> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
>>> + * 0 otherwise
>>> + */
>>> +int bpf_sock_destroy(struct sock_common *sock)
>> 
>> Prefix with __bpf_kfunc (see other kfuncs).
> 
> Will do!
> 
>> 
>>> +{
>>> +	/* Validates the socket can be type casted to a full socket. */
>>> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);
>>> +
>>> +	if (!sk)
>>> +		return -EINVAL;
>>> +
>>> +	/* The locking semantics that allow for synchronous execution of the
>>> +	 * destroy handlers are only supported for TCP and UDP.
>>> +	 */
>>> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>>> +}
>>> +
>>> +__diag_pop()
>>> +
>>> +BTF_SET8_START(sock_destroy_kfunc_set)
>>> +BTF_ID_FLAGS(func, bpf_sock_destroy)
>>> +BTF_SET8_END(sock_destroy_kfunc_set)
>>> +
>>> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
>>> +	.owner = THIS_MODULE,
>>> +	.set   = &sock_destroy_kfunc_set,
>>> +};
>>> +
>>> +static int init_subsystem(void)
>>> +{
>>> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
>> 
>> Is it safe? Does it mean I can call bpf_sock_destroy from any tracing
>> program from anywhere? What if the socket is not locked?
>> IOW, do we have to constrain it to the iterator programs (at least for
>> now)?
> 
> Given kprobes are not considered as part of BPF_PROG_TYPE_TRACING, I'm not sure if there are other tracing programs with sock/sock_common arguments. Regardless, this is a valid point. I had brought up a similar topic earlier during the v1 discussion - https://lore.kernel.org/bpf/78E434B0-06A9-466F-A061-B9A05DC6DE6D@xxxxxxxxxxxxx/. I suppose you would have a similar problem in the case of setsockopt* helpers. 
> Is the general topic of limiting access for kfunc to a subset of BPF_PROG_* programs being discussed?

I've submitted this as a potential topic for the lsm/mm/bpf agenda. Also, related to this discussion: could we have something similar to lockdep_sock_is_held with less overhead? 

> 
>> 
>>> +}
>>> +late_initcall(init_subsystem);
>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>> index 33f559f491c8..8123c264d8ea 100644
>>> --- a/net/ipv4/tcp.c
>>> +++ b/net/ipv4/tcp.c
>>> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>>> 		return 0;
>>> 	}
>> 
>>> -	/* Don't race with userspace socket closes such as tcp_close. */
>>> -	lock_sock(sk);
>>> +	/* BPF context ensures sock locking. */
>>> +	if (!has_current_bpf_ctx())
>>> +		/* Don't race with userspace socket closes such as tcp_close. */
>>> +		lock_sock(sk);
>> 
>>> 	if (sk->sk_state == TCP_LISTEN) {
>>> 		tcp_set_state(sk, TCP_CLOSE);
>>> @@ -4688,7 +4690,9 @@ int tcp_abort(struct sock *sk, int err)
>> 
>>> 	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
>>> 	local_bh_disable();
>>> -	bh_lock_sock(sk);
>>> +	if (!has_current_bpf_ctx())
>>> +		bh_lock_sock(sk);
>>> +
>> 
>>> 	if (!sock_flag(sk, SOCK_DEAD)) {
>>> 		sk->sk_err = err;
>>> @@ -4700,10 +4704,13 @@ int tcp_abort(struct sock *sk, int err)
>>> 		tcp_done(sk);
>>> 	}
>> 
>>> -	bh_unlock_sock(sk);
>>> +	if (!has_current_bpf_ctx())
>>> +		bh_unlock_sock(sk);
>>> +
>>> 	local_bh_enable();
>>> 	tcp_write_queue_purge(sk);
>>> -	release_sock(sk);
>>> +	if (!has_current_bpf_ctx())
>>> +		release_sock(sk);
>>> 	return 0;
>>> }
>>> EXPORT_SYMBOL_GPL(tcp_abort);
>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>> index 2f3978de45f2..1bc9ad92c3d4 100644
>>> --- a/net/ipv4/udp.c
>>> +++ b/net/ipv4/udp.c
>>> @@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);
>> 
>>> int udp_abort(struct sock *sk, int err)
>>> {
>>> -	lock_sock(sk);
>>> +	/* BPF context ensures sock locking. */
>>> +	if (!has_current_bpf_ctx())
>>> +		lock_sock(sk);
>> 
>>> 	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
>>> 	 * with close()
>>> @@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
>>> 	__udp_disconnect(sk, 0);
>> 
>>> out:
>>> -	release_sock(sk);
>>> +	if (!has_current_bpf_ctx())
>>> +		release_sock(sk);
>> 
>>> 	return 0;
>>> }
>>> --
>>> 2.34.1