On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote: > On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote: >> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote: >> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote: >> > [ ... ] >> > >> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk, >> >> sk->sk_prot = psock->sk_proto; >> >> psock->sk_proto = NULL; >> >> } >> >> + >> >> + if (psock->icsk_af_ops) { >> >> + icsk->icsk_af_ops = psock->icsk_af_ops; >> >> + psock->icsk_af_ops = NULL; >> >> + } >> >> } >> > >> > [ ... ] >> > >> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk, >> >> + struct sk_buff *skb, >> >> + struct request_sock *req, >> >> + struct dst_entry *dst, >> >> + struct request_sock *req_unhash, >> >> + bool *own_req) >> >> +{ >> >> + const struct inet_connection_sock_af_ops *ops; >> >> + void (*write_space)(struct sock *sk); >> >> + struct sk_psock *psock; >> >> + struct proto *proto; >> >> + struct sock *child; >> >> + >> >> + rcu_read_lock(); >> >> + psock = sk_psock(sk); >> >> + if (likely(psock)) { >> >> + proto = psock->sk_proto; >> >> + write_space = psock->saved_write_space; >> >> + ops = psock->icsk_af_ops; >> > It is not immediately clear to me what ensure >> > ops is not NULL here. >> > >> > It is likely I missed something. A short comment would >> > be very useful here. >> >> I can see the readability problem. Looking at it now, perhaps it should >> be rewritten, to the same effect, as: >> >> static struct sock *tcp_bpf_syn_recv_sock(...) >> { >> const struct inet_connection_sock_af_ops *ops = NULL; >> ... >> >> rcu_read_lock(); >> psock = sk_psock(sk); >> if (likely(psock)) { >> proto = psock->sk_proto; >> write_space = psock->saved_write_space; >> ops = psock->icsk_af_ops; >> } >> rcu_read_unlock(); >> >> if (!ops) >> ops = inet_csk(sk)->icsk_af_ops; >> child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req); >> >> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it >> properly. To double check what happens here: > I did not mean the init path. The init path is fine since it init > eveything on psock before publishing the sk to the sock_map. > > I was thinking the delete path (e.g. sock_map_delete_elem). It is not clear > to me what prevent the earlier pasted sk_psock_restore_proto() which sets > psock->icsk_af_ops to NULL from running in parallel with > tcp_bpf_syn_recv_sock()? An explanation would be useful. Ah, I misunderstood. Nothing prevents the race, AFAIK. Setting psock->icsk_af_ops to null on restore and not checking for it here was a bad move on my side. Also I need to revisit what to do about psock->sk_proto so the child socket doesn't end up with null sk_proto. This race should be easy enough to trigger. Will give it a shot. Thank you for bringing this up, Jakub > >> >> In sock_map_link we do a setup dance where we first create the psock and >> later initialize the socket callbacks (tcp_bpf_init). >> >> static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs, >> struct sock *sk) >> { >> ... >> if (psock) { >> ... >> } else { >> psock = sk_psock_init(sk, map->numa_node); >> if (!psock) { >> ret = -ENOMEM; >> goto out_progs; >> } >> sk_psock_is_new = true; >> } >> ... >> if (sk_psock_is_new) { >> ret = tcp_bpf_init(sk); >> if (ret < 0) >> goto out_drop; >> } else { >> tcp_bpf_reinit(sk); >> } >> >> The "if (sk_psock_new)" branch triggers the call chain that leads to >> saving & overriding socket callbacks. >> >> tcp_bpf_init -> tcp_bpf_update_sk_prot -> sk_psock_update_proto >> >> Among them, icsk_af_ops. >> >> static inline void sk_psock_update_proto(...) >> { >> ... >> psock->icsk_af_ops = icsk->icsk_af_ops; >> icsk->icsk_af_ops = af_ops; >> } >> >> Goes without saying that a comment is needed. >> >> Thanks for the feedback, >> Jakub