From: Eric Dumazet <eric.dumazet@xxxxxxxxx> Date: Thu, 10 Jun 2021 22:36:27 +0200 > On 5/21/21 8:21 PM, Kuniyuki Iwashima wrote: > > This patch also changes the code to call reuseport_migrate_sock() and > > inet_reqsk_clone(), but unlike the other cases, we do not call > > inet_reqsk_clone() right after reuseport_migrate_sock(). > > > > Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener > > has three kinds of refcnt: > > > > (A) for listener itself > > (B) carried by reuqest_sock > > (C) sock_hold() in tcp_v[46]_rcv() > > > > While processing the req, (A) may disappear by close(listener). Also, (B) > > can disappear by accept(listener) once we put the req into the accept > > queue. So, we have to hold another refcnt (C) for the listener to prevent > > use-after-free. > > > > For socket migration, we call reuseport_migrate_sock() to select a listener > > with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv(). > > This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv(). > > Thus we have to take another refcnt (B) for the newly cloned request_sock. > > > > In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and > > try to put the new req into the accept queue. By migrating req after > > winning the "own_req" race, we can avoid such a worst situation: > > > > CPU 1 looks up req1 > > CPU 2 looks up req1, unhashes it, then CPU 1 loses the race > > CPU 3 looks up req2, unhashes it, then CPU 2 loses the race > > ... > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@xxxxxxxxxxxx> > > Acked-by: Martin KaFai Lau <kafai@xxxxxx> > > --- > > net/ipv4/inet_connection_sock.c | 34 ++++++++++++++++++++++++++++++--- > > net/ipv4/tcp_ipv4.c | 20 +++++++++++++------ > > net/ipv4/tcp_minisocks.c | 4 ++-- > > net/ipv6/tcp_ipv6.c | 14 +++++++++++--- > > 4 files changed, 58 insertions(+), 14 deletions(-) > > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c > > index c1f068464363..b795198f919a 100644 > > --- a/net/ipv4/inet_connection_sock.c > > +++ b/net/ipv4/inet_connection_sock.c > > @@ -1113,12 +1113,40 @@ struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child, > > struct request_sock *req, bool own_req) > > { > > if (own_req) { > > - inet_csk_reqsk_queue_drop(sk, req); > > - reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req); > > - if (inet_csk_reqsk_queue_add(sk, req, child)) > > + inet_csk_reqsk_queue_drop(req->rsk_listener, req); > > + reqsk_queue_removed(&inet_csk(req->rsk_listener)->icsk_accept_queue, req); > > + > > + if (sk != req->rsk_listener) { > > + /* another listening sk has been selected, > > + * migrate the req to it. > > + */ > > + struct request_sock *nreq; > > + > > + /* hold a refcnt for the nreq->rsk_listener > > + * which is assigned in inet_reqsk_clone() > > + */ > > + sock_hold(sk); > > + nreq = inet_reqsk_clone(req, sk); > > + if (!nreq) { > > + inet_child_forget(sk, req, child); > > Don't you need a sock_put(sk) here ? Yes. If nreq == NULL, inet_reqsk_clone() calls sock_put(). > > \ > > + goto child_put; > > + } > > + > > + refcount_set(&nreq->rsk_refcnt, 1); > > + if (inet_csk_reqsk_queue_add(sk, nreq, child)) { > > + reqsk_migrate_reset(req); > > + reqsk_put(req); > > + return child; > > + } > > + > > + reqsk_migrate_reset(nreq); > > + __reqsk_free(nreq); > > + } else if (inet_csk_reqsk_queue_add(sk, req, child)) { > > return child; > > + } > >