On Thu, Dec 12, 2019 at 08:23 PM CET, Martin Lau wrote: > On Thu, Dec 12, 2019 at 12:27:19PM +0100, Jakub Sitnicki wrote: >> On Wed, Dec 11, 2019 at 06:20 PM CET, Martin Lau wrote: >> > On Tue, Dec 10, 2019 at 03:45:37PM +0100, Jakub Sitnicki wrote: >> >> John, Martin, >> >> >> >> On Tue, Nov 26, 2019 at 07:36 PM CET, Jakub Sitnicki wrote: >> >> > On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote: >> >> >> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote: >> >> >>> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote: >> >> >>> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote: >> >> >>> > [ ... ] >> >> >>> > >> >> >>> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk, >> >> >>> >> sk->sk_prot = psock->sk_proto; >> >> >>> >> psock->sk_proto = NULL; >> >> >>> >> } >> >> >>> >> + >> >> >>> >> + if (psock->icsk_af_ops) { >> >> >>> >> + icsk->icsk_af_ops = psock->icsk_af_ops; >> >> >>> >> + psock->icsk_af_ops = NULL; >> >> >>> >> + } >> >> >>> >> } >> >> >>> > >> >> >>> > [ ... ] >> >> >>> > >> >> >>> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk, >> >> >>> >> + struct sk_buff *skb, >> >> >>> >> + struct request_sock *req, >> >> >>> >> + struct dst_entry *dst, >> >> >>> >> + struct request_sock *req_unhash, >> >> >>> >> + bool *own_req) >> >> >>> >> +{ >> >> >>> >> + const struct inet_connection_sock_af_ops *ops; >> >> >>> >> + void (*write_space)(struct sock *sk); >> >> >>> >> + struct sk_psock *psock; >> >> >>> >> + struct proto *proto; >> >> >>> >> + struct sock *child; >> >> >>> >> + >> >> >>> >> + rcu_read_lock(); >> >> >>> >> + psock = sk_psock(sk); >> >> >>> >> + if (likely(psock)) { >> >> >>> >> + proto = psock->sk_proto; >> >> >>> >> + write_space = psock->saved_write_space; >> >> >>> >> + ops = psock->icsk_af_ops; >> >> >>> > It is not immediately clear to me what ensure >> >> >>> > ops is not NULL here. >> >> >>> > >> >> >>> > It is likely I missed something. A short comment would >> >> >>> > be very useful here. >> >> >>> >> >> >>> I can see the readability problem. Looking at it now, perhaps it should >> >> >>> be rewritten, to the same effect, as: >> >> >>> >> >> >>> static struct sock *tcp_bpf_syn_recv_sock(...) >> >> >>> { >> >> >>> const struct inet_connection_sock_af_ops *ops = NULL; >> >> >>> ... >> >> >>> >> >> >>> rcu_read_lock(); >> >> >>> psock = sk_psock(sk); >> >> >>> if (likely(psock)) { >> >> >>> proto = psock->sk_proto; >> >> >>> write_space = psock->saved_write_space; >> >> >>> ops = psock->icsk_af_ops; >> >> >>> } >> >> >>> rcu_read_unlock(); >> >> >>> >> >> >>> if (!ops) >> >> >>> ops = inet_csk(sk)->icsk_af_ops; >> >> >>> child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req); >> >> >>> >> >> >>> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it >> >> >>> properly. To double check what happens here: >> >> >> I did not mean the init path. The init path is fine since it init >> >> >> eveything on psock before publishing the sk to the sock_map. >> >> >> >> >> >> I was thinking the delete path (e.g. sock_map_delete_elem). It is not clear >> >> >> to me what prevent the earlier pasted sk_psock_restore_proto() which sets >> >> >> psock->icsk_af_ops to NULL from running in parallel with >> >> >> tcp_bpf_syn_recv_sock()? An explanation would be useful. >> >> > >> >> > Ah, I misunderstood. Nothing prevents the race, AFAIK. >> >> > >> >> > Setting psock->icsk_af_ops to null on restore and not checking for it >> >> > here was a bad move on my side. Also I need to revisit what to do about >> >> > psock->sk_proto so the child socket doesn't end up with null sk_proto. >> >> > >> >> > This race should be easy enough to trigger. Will give it a shot. >> >> >> >> I've convinced myself that this approach is racy beyond repair. >> >> >> >> Once syn_recv_sock() has returned it is too late to reset the child >> >> sk_user_data and restore its callbacks. It has been already inserted >> >> into ehash and ingress path can invoke its callbacks. >> >> >> >> The race can be triggered with with a reproducer where: >> >> >> >> thread-1: >> >> >> >> p = accept(s, ...); >> >> close(p); >> >> >> >> thread-2: >> >> >> >> bpf_map_update_elem(mapfd, &key, &s, BPF_NOEXIST); >> >> bpf_map_delete_elem(mapfd, &key); >> >> >> >> This a dead-end because we can't have the parent and the child share the >> >> psock state. Even though psock itself is refcounted, and potentially we >> >> could grab a reference before cloning the parent, link into the map that >> >> psock holds is not. >> >> >> >> Two ways out come to mind. Both involve touching TCP code, which I was >> >> hoping to avoid: >> >> >> >> 1) reset sk_user_data when initializing the child >> >> >> >> This is problematic because tcp_bpf callbacks are not designed to >> >> handle sockets with no psock _and_ with overridden sk_prot >> >> callbacks. (Although, I think they could if the fallback was directly >> >> on {tcp,tcpv6}_prot based on socket domain.) >> >> >> >> Also, there are other sk_user_data users like DRBD which rely on >> >> sharing the sk_user_data pointer between parent and child, if I read >> >> the code correctly [0]. If anything, clearing the sk_user_data on >> >> clone would have to be guarded by a flag. >> > Can the copy/not-to-copy sk_user_data decision be made in >> > sk_clone_lock()? >> >> Yes, this could be pushed down to sk_clone_lock(), where we do similar >> work (reset sk_reuseport_cb and clone bpf_sk_storage): > aha. I missed your eariler "clearing the sk_user_data on clone would have > to be guarded by a flag..." part. It turns out we were talking the same > thing on (1). sock_flag works better if there is still bit left (and it > seems there is one), although I was thinking more like adding > something (e.g. a func ptr) to 'struct proto' to mangle sk_user_data > before returning newsk....but not sure this kind of logic > belongs to 'struct proto' Sorry for late reply. We have 4 bits left by my count. The multi-line comment for SOCK_NOFCS is getting in the way of counting them line-for-bit. A callback invoked on socket clone is something I was considering too. I'm not sure either where it belongs. At risk of being too use-case specific, perhaps it could live together with sk_user_data and sk_prot, which it would mangle on sk_clone_lock(): struct sock { ... void *sk_user_data; void (*sk_clone)(struct sock *sk, struct sock *newsk); ... } But, I feel adding a new sock field just for this wouldn't be justified. I can get by with a sock flag. Unless we have other uses for it? > >> >> /* User data can hold reference. Child must not >> * inherit the pointer without acquiring a reference. >> */ >> if (sock_flag(sk, SOCK_OWNS_USER_DATA)) { >> sock_reset_flag(newsk, SOCK_OWNS_USER_DATA); >> RCU_INIT_POINTER(newsk->sk_user_data, NULL); >> } >> >> I belive this would still need to be guarded by a flag. Do you see >> value in clearing child sk_user_data on clone as opposed to dealying >> that work until accept() time? > It seems to me clearing things up front at the very beginning is more > straight forward, such that it does not have to worry about the > sk_user_data may be used in a wrong way before it gets a chance > to be cleared in accept(). > > Just something to consider, if it is obvious that there is no hole in > clearing it in accept(), it is fine too. Just when I thought I could get away with lazily clearing the sk_user_data at accept() time, it occurred to me that it is not enough. Listening socket could get deleted from sockmap before a child socket that inherited a copy of sk_user_data pointer gets accept()'ed. In such scenario the pointer would not get NULL'ed on accept(), because listening socket would have it's sk_prot->accept restored by then. I will need that flag after all... -jkbs > >> >> >> >> 2) Restore sk_prot callbacks on clone to {tcp,tcpv6}_prot >> >> >> >> The simpler way out. tcp_bpf callbacks never get invoked on the child >> >> socket so the copied psock reference is no longer a problem. We can >> >> clear the pointer on accept(). >> >> >> >> So far I wasn't able poke any holes in it and it comes down to >> >> patching tcp_create_openreq_child() with: >> >> >> >> /* sk_msg and ULP frameworks can override the callbacks into >> >> * protocol. We don't assume they are intended to be inherited >> >> * by the child. Frameworks can re-install the callbacks on >> >> * accept() if needed. >> >> */ >> >> WRITE_ONCE(newsk->sk_prot, sk->sk_prot_creator); >> >> >> >> That's what I'm going with for v2. >> >> >> >> Open to suggestions. >> >> >> >> Thanks, >> >> Jakub >> >> >> >> BTW. Reading into kTLS code, I noticed it has been limited down to just >> >> established sockets due to the same problem I'm struggling with here: >> >> >> >> static int tls_init(struct sock *sk) >> >> { >> >> ... >> >> /* The TLS ulp is currently supported only for TCP sockets >> >> * in ESTABLISHED state. >> >> * Supporting sockets in LISTEN state will require us >> >> * to modify the accept implementation to clone rather then >> >> * share the ulp context. >> >> */ >> >> if (sk->sk_state != TCP_ESTABLISHED) >> >> return -ENOTCONN; >> >> >> >> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_v5.5-2Drc1_source_drivers_block_drbd_drbd-5Freceiver.c-23L682&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=z2Cz1gEcqiw-8YqVOluxlUHh_CBs6PJWQN2vgirOyFk&s=WAiM0asZN0OkqrW02xm2mCMIzWhKQCc3KiY7pzMKNg4&e=