On Tue, Dec 10, 2019 at 03:45:37PM +0100, Jakub Sitnicki wrote: > John, Martin, > > On Tue, Nov 26, 2019 at 07:36 PM CET, Jakub Sitnicki wrote: > > On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote: > >> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote: > >>> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote: > >>> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote: > >>> > [ ... ] > >>> > > >>> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk, > >>> >> sk->sk_prot = psock->sk_proto; > >>> >> psock->sk_proto = NULL; > >>> >> } > >>> >> + > >>> >> + if (psock->icsk_af_ops) { > >>> >> + icsk->icsk_af_ops = psock->icsk_af_ops; > >>> >> + psock->icsk_af_ops = NULL; > >>> >> + } > >>> >> } > >>> > > >>> > [ ... ] > >>> > > >>> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk, > >>> >> + struct sk_buff *skb, > >>> >> + struct request_sock *req, > >>> >> + struct dst_entry *dst, > >>> >> + struct request_sock *req_unhash, > >>> >> + bool *own_req) > >>> >> +{ > >>> >> + const struct inet_connection_sock_af_ops *ops; > >>> >> + void (*write_space)(struct sock *sk); > >>> >> + struct sk_psock *psock; > >>> >> + struct proto *proto; > >>> >> + struct sock *child; > >>> >> + > >>> >> + rcu_read_lock(); > >>> >> + psock = sk_psock(sk); > >>> >> + if (likely(psock)) { > >>> >> + proto = psock->sk_proto; > >>> >> + write_space = psock->saved_write_space; > >>> >> + ops = psock->icsk_af_ops; > >>> > It is not immediately clear to me what ensure > >>> > ops is not NULL here. > >>> > > >>> > It is likely I missed something. A short comment would > >>> > be very useful here. > >>> > >>> I can see the readability problem. Looking at it now, perhaps it should > >>> be rewritten, to the same effect, as: > >>> > >>> static struct sock *tcp_bpf_syn_recv_sock(...) > >>> { > >>> const struct inet_connection_sock_af_ops *ops = NULL; > >>> ... > >>> > >>> rcu_read_lock(); > >>> psock = sk_psock(sk); > >>> if (likely(psock)) { > >>> proto = psock->sk_proto; > >>> write_space = psock->saved_write_space; > >>> ops = psock->icsk_af_ops; > >>> } > >>> rcu_read_unlock(); > >>> > >>> if (!ops) > >>> ops = inet_csk(sk)->icsk_af_ops; > >>> child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req); > >>> > >>> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it > >>> properly. To double check what happens here: > >> I did not mean the init path. The init path is fine since it init > >> eveything on psock before publishing the sk to the sock_map. > >> > >> I was thinking the delete path (e.g. sock_map_delete_elem). It is not clear > >> to me what prevent the earlier pasted sk_psock_restore_proto() which sets > >> psock->icsk_af_ops to NULL from running in parallel with > >> tcp_bpf_syn_recv_sock()? An explanation would be useful. > > > > Ah, I misunderstood. Nothing prevents the race, AFAIK. > > > > Setting psock->icsk_af_ops to null on restore and not checking for it > > here was a bad move on my side. Also I need to revisit what to do about > > psock->sk_proto so the child socket doesn't end up with null sk_proto. > > > > This race should be easy enough to trigger. Will give it a shot. > > I've convinced myself that this approach is racy beyond repair. > > Once syn_recv_sock() has returned it is too late to reset the child > sk_user_data and restore its callbacks. It has been already inserted > into ehash and ingress path can invoke its callbacks. > > The race can be triggered with with a reproducer where: > > thread-1: > > p = accept(s, ...); > close(p); > > thread-2: > > bpf_map_update_elem(mapfd, &key, &s, BPF_NOEXIST); > bpf_map_delete_elem(mapfd, &key); > > This a dead-end because we can't have the parent and the child share the > psock state. Even though psock itself is refcounted, and potentially we > could grab a reference before cloning the parent, link into the map that > psock holds is not. > > Two ways out come to mind. Both involve touching TCP code, which I was > hoping to avoid: > > 1) reset sk_user_data when initializing the child > > This is problematic because tcp_bpf callbacks are not designed to > handle sockets with no psock _and_ with overridden sk_prot > callbacks. (Although, I think they could if the fallback was directly > on {tcp,tcpv6}_prot based on socket domain.) > > Also, there are other sk_user_data users like DRBD which rely on > sharing the sk_user_data pointer between parent and child, if I read > the code correctly [0]. If anything, clearing the sk_user_data on > clone would have to be guarded by a flag. Can the copy/not-to-copy sk_user_data decision be made in sk_clone_lock()? > > 2) Restore sk_prot callbacks on clone to {tcp,tcpv6}_prot > > The simpler way out. tcp_bpf callbacks never get invoked on the child > socket so the copied psock reference is no longer a problem. We can > clear the pointer on accept(). > > So far I wasn't able poke any holes in it and it comes down to > patching tcp_create_openreq_child() with: > > /* sk_msg and ULP frameworks can override the callbacks into > * protocol. We don't assume they are intended to be inherited > * by the child. Frameworks can re-install the callbacks on > * accept() if needed. > */ > WRITE_ONCE(newsk->sk_prot, sk->sk_prot_creator); > > That's what I'm going with for v2. > > Open to suggestions. > > Thanks, > Jakub > > BTW. Reading into kTLS code, I noticed it has been limited down to just > established sockets due to the same problem I'm struggling with here: > > static int tls_init(struct sock *sk) > { > ... > /* The TLS ulp is currently supported only for TCP sockets > * in ESTABLISHED state. > * Supporting sockets in LISTEN state will require us > * to modify the accept implementation to clone rather then > * share the ulp context. > */ > if (sk->sk_state != TCP_ESTABLISHED) > return -ENOTCONN; > > [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_v5.5-2Drc1_source_drivers_block_drbd_drbd-5Freceiver.c-23L682&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=z2Cz1gEcqiw-8YqVOluxlUHh_CBs6PJWQN2vgirOyFk&s=WAiM0asZN0OkqrW02xm2mCMIzWhKQCc3KiY7pzMKNg4&e=