On Tue, Nov 26, 2019 at 07:43 PM CET, John Fastabend wrote: > Martin Lau wrote: >> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote: >> > On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote: >> > > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote: >> > > [ ... ] >> > > >> > >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk, >> > >> sk->sk_prot = psock->sk_proto; >> > >> psock->sk_proto = NULL; >> > >> } >> > >> + >> > >> + if (psock->icsk_af_ops) { >> > >> + icsk->icsk_af_ops = psock->icsk_af_ops; >> > >> + psock->icsk_af_ops = NULL; >> > >> + } >> > >> } >> > > >> > > [ ... ] >> > > >> > >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk, >> > >> + struct sk_buff *skb, >> > >> + struct request_sock *req, >> > >> + struct dst_entry *dst, >> > >> + struct request_sock *req_unhash, >> > >> + bool *own_req) >> > >> +{ >> > >> + const struct inet_connection_sock_af_ops *ops; >> > >> + void (*write_space)(struct sock *sk); >> > >> + struct sk_psock *psock; >> > >> + struct proto *proto; >> > >> + struct sock *child; >> > >> + >> > >> + rcu_read_lock(); >> > >> + psock = sk_psock(sk); >> > >> + if (likely(psock)) { >> > >> + proto = psock->sk_proto; >> > >> + write_space = psock->saved_write_space; >> > >> + ops = psock->icsk_af_ops; >> > > It is not immediately clear to me what ensure >> > > ops is not NULL here. >> > > >> > > It is likely I missed something. A short comment would >> > > be very useful here. >> > >> > I can see the readability problem. Looking at it now, perhaps it should >> > be rewritten, to the same effect, as: >> > >> > static struct sock *tcp_bpf_syn_recv_sock(...) >> > { >> > const struct inet_connection_sock_af_ops *ops = NULL; >> > ... >> > >> > rcu_read_lock(); >> > psock = sk_psock(sk); >> > if (likely(psock)) { >> > proto = psock->sk_proto; >> > write_space = psock->saved_write_space; >> > ops = psock->icsk_af_ops; >> > } >> > rcu_read_unlock(); >> > >> > if (!ops) >> > ops = inet_csk(sk)->icsk_af_ops; >> > child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req); >> > >> > If psock->icsk_af_ops were NULL, it would mean we haven't initialized it >> > properly. To double check what happens here: >> I did not mean the init path. The init path is fine since it init >> eveything on psock before publishing the sk to the sock_map. >> >> I was thinking the delete path (e.g. sock_map_delete_elem). It is not clear >> to me what prevent the earlier pasted sk_psock_restore_proto() which sets >> psock->icsk_af_ops to NULL from running in parallel with >> tcp_bpf_syn_recv_sock()? An explanation would be useful. >> > > I'll answer. Updates are protected via sk_callback_lock so we don't have > parrallel updates in-flight causing write_space and sk_proto to be out > of sync. However access should be OK because its a pointer write we > never update the pointer in place, e.g. > > static inline void sk_psock_restore_proto(struct sock *sk, > struct sk_psock *psock) > { > + struct inet_connection_sock *icsk = inet_csk(sk); > + > sk->sk_write_space = psock->saved_write_space; > > if (psock->sk_proto) { > struct inet_connection_sock *icsk = inet_csk(sk); > bool has_ulp = !!icsk->icsk_ulp_data; > > if (has_ulp) > tcp_update_ulp(sk, psock->sk_proto); > else > sk->sk_prot = psock->sk_proto; > psock->sk_proto = NULL; > } > > + > + if (psock->icsk_af_ops) { > + icsk->icsk_af_ops = psock->icsk_af_ops; > + psock->icsk_af_ops = NULL; > + } > } > > In restore case either psock->icsk_af_ops is null or not. If its > null below code catches it. If its not null (from init path) then > we have a valid pointer. > > rcu_read_lock(); > psock = sk_psock(sk); > if (likely(psock)) { > proto = psock->sk_proto; > write_space = psock->saved_write_space; > ops = psock->icsk_af_ops; > } > rcu_read_unlock(); > > if (!ops) > ops = inet_csk(sk)->icsk_af_ops; > child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req); > > > We should do this with proper READ_ONCE/WRITE_ONCE to make it clear > what is going on and to stop compiler from breaking these assumptions. I > was going to generate that patch after this series but can do it before > as well. I didn't mention it here because it seems a bit out of scope > for this series because its mostly a fix to older code. +1, looking forward to your patch. Also, as I've recently learned, that should enable KTSAN to reason about the psock code [0]. > Also I started to think that write_space might be out of sync with ops but > it seems we never actually remove psock_write_space until after > rcu grace period so that should be OK as well and always point to the > previous write_space. > > Finally I wondered if we could remove the ops and then add it back > quickly which seems at least in theory possible, but that would get > hit with a grace period because we can't have conflicting psock > definitions on the same sock. So expanding the rcu block to include > the ops = inet_csk(sk)->icsk_af_ops would fix that case. I see, ops = inet_csk(sk)->icsk_af_ops might read out a re-overwritten ops after sock_map_unlink, followed by sock_map_link. Ouch. > So in summary I think we should expand the rcu lock here to include the > ops = inet_csk(sk)->icsk_af_ops to ensure we dont race with tear > down and create. I'll push the necessary update with WRITE_ONCE and > READ_ONCE to fix that up. Seeing we have to wait until the merge > window opens most likely anyways I'll send those out sooner rather > then later and this series can add the proper annotations as well. Or I could leave psock->icsk_af_ops set in restore_proto, like we do for write_space as you noted. Restoring it twice doesn't seem harmful, it has no side-effects. Less state changes to think about? I'll still have to apply what you suggest for saving psock->sk_proto, though. Thanks, Jakub [0] https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE