Martin Lau wrote: > On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote: > > On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote: > > > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote: > > > [ ... ] > > > > > >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk, > > >> sk->sk_prot = psock->sk_proto; > > >> psock->sk_proto = NULL; > > >> } > > >> + > > >> + if (psock->icsk_af_ops) { > > >> + icsk->icsk_af_ops = psock->icsk_af_ops; > > >> + psock->icsk_af_ops = NULL; > > >> + } > > >> } > > > > > > [ ... ] > > > > > >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk, > > >> + struct sk_buff *skb, > > >> + struct request_sock *req, > > >> + struct dst_entry *dst, > > >> + struct request_sock *req_unhash, > > >> + bool *own_req) > > >> +{ > > >> + const struct inet_connection_sock_af_ops *ops; > > >> + void (*write_space)(struct sock *sk); > > >> + struct sk_psock *psock; > > >> + struct proto *proto; > > >> + struct sock *child; > > >> + > > >> + rcu_read_lock(); > > >> + psock = sk_psock(sk); > > >> + if (likely(psock)) { > > >> + proto = psock->sk_proto; > > >> + write_space = psock->saved_write_space; > > >> + ops = psock->icsk_af_ops; > > > It is not immediately clear to me what ensure > > > ops is not NULL here. > > > > > > It is likely I missed something. A short comment would > > > be very useful here. > > > > I can see the readability problem. Looking at it now, perhaps it should > > be rewritten, to the same effect, as: > > > > static struct sock *tcp_bpf_syn_recv_sock(...) > > { > > const struct inet_connection_sock_af_ops *ops = NULL; > > ... > > > > rcu_read_lock(); > > psock = sk_psock(sk); > > if (likely(psock)) { > > proto = psock->sk_proto; > > write_space = psock->saved_write_space; > > ops = psock->icsk_af_ops; > > } > > rcu_read_unlock(); > > > > if (!ops) > > ops = inet_csk(sk)->icsk_af_ops; > > child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req); > > > > If psock->icsk_af_ops were NULL, it would mean we haven't initialized it > > properly. To double check what happens here: > I did not mean the init path. The init path is fine since it init > eveything on psock before publishing the sk to the sock_map. > > I was thinking the delete path (e.g. sock_map_delete_elem). It is not clear > to me what prevent the earlier pasted sk_psock_restore_proto() which sets > psock->icsk_af_ops to NULL from running in parallel with > tcp_bpf_syn_recv_sock()? An explanation would be useful. > I'll answer. Updates are protected via sk_callback_lock so we don't have parrallel updates in-flight causing write_space and sk_proto to be out of sync. However access should be OK because its a pointer write we never update the pointer in place, e.g. static inline void sk_psock_restore_proto(struct sock *sk, struct sk_psock *psock) { + struct inet_connection_sock *icsk = inet_csk(sk); + sk->sk_write_space = psock->saved_write_space; if (psock->sk_proto) { struct inet_connection_sock *icsk = inet_csk(sk); bool has_ulp = !!icsk->icsk_ulp_data; if (has_ulp) tcp_update_ulp(sk, psock->sk_proto); else sk->sk_prot = psock->sk_proto; psock->sk_proto = NULL; } + + if (psock->icsk_af_ops) { + icsk->icsk_af_ops = psock->icsk_af_ops; + psock->icsk_af_ops = NULL; + } } In restore case either psock->icsk_af_ops is null or not. If its null below code catches it. If its not null (from init path) then we have a valid pointer. rcu_read_lock(); psock = sk_psock(sk); if (likely(psock)) { proto = psock->sk_proto; write_space = psock->saved_write_space; ops = psock->icsk_af_ops; } rcu_read_unlock(); if (!ops) ops = inet_csk(sk)->icsk_af_ops; child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req); We should do this with proper READ_ONCE/WRITE_ONCE to make it clear what is going on and to stop compiler from breaking these assumptions. I was going to generate that patch after this series but can do it before as well. I didn't mention it here because it seems a bit out of scope for this series because its mostly a fix to older code. Also I started to think that write_space might be out of sync with ops but it seems we never actually remove psock_write_space until after rcu grace period so that should be OK as well and always point to the previous write_space. Finally I wondered if we could remove the ops and then add it back quickly which seems at least in theory possible, but that would get hit with a grace period because we can't have conflicting psock definitions on the same sock. So expanding the rcu block to include the ops = inet_csk(sk)->icsk_af_ops would fix that case. So in summary I think we should expand the rcu lock here to include the ops = inet_csk(sk)->icsk_af_ops to ensure we dont race with tear down and create. I'll push the necessary update with WRITE_ONCE and READ_ONCE to fix that up. Seeing we have to wait until the merge window opens most likely anyways I'll send those out sooner rather then later and this series can add the proper annotations as well.