On Thu, Dec 10, 2020 at 01:57:19AM +0900, Kuniyuki Iwashima wrote: [ ... ] > > > > I think it is a bit complex to pass the new listener from > > > > reuseport_detach_sock() to inet_csk_listen_stop(). > > > > > > > > __tcp_close/tcp_disconnect/tcp_abort > > > > |-tcp_set_state > > > > | |-unhash > > > > | |-reuseport_detach_sock (return nsk) > > > > |-inet_csk_listen_stop > > > Picking the new listener does not have to be done in > > > reuseport_detach_sock(). > > > > > > IIUC, it is done there only because it prefers to pick > > > the last sk from socks[] when bpf prog is not attached. > > > This seems to get into the way of exploring other potential > > > implementation options. > > > > Yes. > > This is just idea, but we can reserve the last index of socks[] to hold the > > last 'moved' socket in reuseport_detach_sock() and use it in > > inet_csk_listen_stop(). > > > > > > > Merging the discussion on the last socks[] pick from another thread: > > > > > > > > I think most applications start new listeners before closing listeners, in > > > > this case, selecting the moved socket as the new listener works well. > > > > > > > > > > > > > That said, if it is still desired to do a random pick by kernel when > > > > > there is no bpf prog, it probably makes sense to guard it in a sysctl as > > > > > suggested in another reply. To keep it simple, I would also keep this > > > > > kernel-pick consistent instead of request socket is doing something > > > > > different from the unhash path. > > > > > > > > Then, is this way better to keep kernel-pick consistent? > > > > > > > > 1. call reuseport_select_migrated_sock() without sk_hash from any path > > > > 2. generate a random number in reuseport_select_migrated_sock() > > > > 3. pass it to __reuseport_select_sock() only for select-by-hash > > > > (4. pass 0 as sk_hash to bpf_run_sk_reuseport not to use it) > > > > 5. do migration per queue in inet_csk_listen_stop() or per request in > > > > receive path. > > > > > > > > I understand it is beautiful to keep consistensy, but also think > > > > the kernel-pick with heuristic performs better than random-pick. > > > I think discussing the best kernel pick without explicit user input > > > is going to be a dead end. There is always a case that > > > makes this heuristic (or guess) fail. e.g. what if multiple > > > sk(s) being closed are always the last one in the socks[]? > > > all their child sk(s) will then be piled up at one listen sk > > > because the last socks[] is always picked? > > > > There can be such a case, but it means the newly listened sockets are > > closed earlier than old ones. > > > > > > > Lets assume the last socks[] is indeed the best for all cases. Then why > > > the in-progress req don't pick it this way? I feel the implementation > > > is doing what is convenient at that point. And that is fine, I think > > > > In this patchset, I originally assumed four things: > > > > migration should be done > > (i) from old to new > > (ii) to redistribute requests evenly as possible > > (iii) to keep the order of requests in the queue > > (resulting in splicing queues) > > (iv) in O(1) for scalability > > (resulting in fix-up rsk_listener approach) > > > > I selected the last socket in unhash path to satisfy above four because the > > last socket changes at every close() syscall if application closes from > > older socket. > > > > But in receiving ACK or retransmitting SYN+ACK, we cannot get the last > > 'moved' socket. Even if we reserve the last 'moved' socket in the last > > index by the idea above, we cannot sure the last socket is changed after > > close() for each req->listener. For example, we have listeners A, B, C, and > > D, and then call close(A) and close(B), and receive the final ACKs for A > > and B, then both of them are assigned to C. In this case, A for D and B for > > C is desired. So, selecting the last socket in socks[] for incoming > > requests cannnot realize (ii). > > > > This is why I selected the last moved socket in unhash path and a random > > listener in receive path. > > > > > > > for kernel-pick, it should just go for simplicity and stay with > > > the random(/hash) pick instead of pretending the kernel knows the > > > application must operate in a certain way. It is fine > > > that the pick was wrong, the kernel will eventually move the > > > childs/reqs to the survived listen sk. > > > > Exactly. Also the heuristic way is not fair for every application. > > > > After reading below idea (migrated_sk), I think random-pick is better > > at simplicity and passing each sk. > > > > > > > [ I still think the kernel should not even pick if > > > there is no bpf prog to instruct how to pick > > > but I am fine as long as there is a sysctl to > > > guard this. ] > > > > Unless different applications listen on the same port, random-pick can save > > connections which would be aborted. So, I would add a sysctl to do > > migration when no eBPF prog is attached. > > > > > > > I would rather focus on ensuring the bpf prog getting what it > > > needs to make the migration pick. A few things > > > I would like to discuss and explore: > > > > If we splice requests like this, we do not need double lock? > > > > > > > > 1. lock the accept queue of the old listener > > > > 2. unlink all requests and decrement refcount > > > > 3. unlock > > > > 4. update all requests with new listener > > > I guess updating rsk_listener can be done without acquiring > > > the lock in (5) below is because it is done under the > > > listening_hash's bucket lock (and also the global reuseport_lock) so > > > that the new listener will stay in TCP_LISTEN state? > > > > If we do migration in inet_unhash(), the lock is held, but it is not held > > in inet_csk_listen_stop(). > > > > > > > I am not sure iterating the queue under these > > > locks is a very good thing to do though. The queue may not be > > > very long in usual setup but still let see > > > if that can be avoided. > > > > I agree, lock should not be held long. > > > > > > > Do you think the iteration can be done without holding > > > bucket lock and the global reuseport_lock? inet_csk_reqsk_queue_add() > > > is taking the rskq_lock and then check for TCP_LISTEN. May be > > > something similar can be done also? > > > > I think either one is necessary at least, so if the sk_state of selected > > listener is TCP_CLOSE (this is mostly by random-pick of kernel), then we > > have to fall back to call inet_child_forget(). > > > > > > > While doing BPF_SK_REUSEPORT_MIGRATE_REQUEST, > > > the bpf prog can pick per req and have the sk_hash. > > > However, while doing BPF_SK_REUSEPORT_MIGRATE_QUEUE, > > > the bpf prog currently does not have a chance to > > > pick individually for each req/child on the queue. > > > Since it is iterating the queue anyway, does it make > > > sense to also call the bpf to pick for each req/child > > > in the queue? It then can pass sk_hash (from child->sk_hash?) > > > to the bpf prog also instead of current 0. The cost of calling > > > bpf prog is not really that much / signficant at the > > > migration code path. If the queue is somehow > > > unusally long, there is already an existing > > > cond_resched() in inet_csk_listen_stop(). > > > > > > Then, instead of adding sk_reuseport_md->migration, > > > it can then add sk_reuseport_md->migrate_sk. > > > "migrate_sk = req" for in-progress req and "migrate_sk = child" > > > for iterating acceptq. The bpf_prog can then tell what sk (req or child) > > > it is migrating by reading migrate_sk->state. It can then also > > > learn the 4 tuples src/dst ip/port while skb is missing. > > > The sk_reuseport_md->sk can still point to the closed sk > > > such that the bpf prog can learn the cookie. > > > > > > I suspect a few things between BPF_SK_REUSEPORT_MIGRATE_REQUEST > > > and BPF_SK_REUSEPORT_MIGRATE_QUEUE can be folded together > > > by doing the above. It also gives a more consistent > > > interface for the bpf prog, no more MIGRATE_QUEUE vs MIGRATE_REQUEST. > > > > I think this is really nice idea. Also, I tried to implement random-pick > > one by one in inet_csk_listen_stop() yesterday, I found a concern about how > > to handle requests in TFO queue. > > > > The request can be already accepted, so passing it to eBPF prog is > > confusing? But, redistributing randomly can affect all listeners > > unnecessary. How should we handle such requests? > > I've implemented one-by-one migration only for the accept queue for now. > In addition to the concern about TFO queue, You meant this queue: queue->fastopenq.rskq_rst_head? Can "req" be passed? I did not look up the lock/race in details for that though. > I want to discuss which should > we pass NULL or request_sock to eBPF program as migrate_sk when selecting a > listener for SYN ? hmmm... not sure I understand your question. You meant the existing lookup listener case from inet_lhash2_lookup()? There is nothing to migrate at that point, so NULL makes sense to me. migrate_sk's type should be PTR_TO_SOCK_COMMON_OR_NULL. > > ---8<--- > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c > index a82fd4c912be..d0ddd3cb988b 100644 > --- a/net/ipv4/inet_connection_sock.c > +++ b/net/ipv4/inet_connection_sock.c > @@ -1001,6 +1001,29 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk, > } > EXPORT_SYMBOL(inet_csk_reqsk_queue_add); > > +static bool inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk, struct request_sock *req) > +{ > + struct request_sock_queue *queue = &inet_csk(nsk)->icsk_accept_queue; > + bool migrated = false; > + > + spin_lock(&queue->rskq_lock); > + if (likely(nsk->sk_state == TCP_LISTEN)) { > + migrated = true; > + > + req->dl_next = NULL; > + if (queue->rskq_accept_head == NULL) > + WRITE_ONCE(queue->rskq_accept_head, req); > + else > + queue->rskq_accept_tail->dl_next = req; > + queue->rskq_accept_tail = req; > + sk_acceptq_added(nsk); > + inet_csk_reqsk_queue_migrated(sk, nsk, req); need to first resolve the question raised in patch 5 regarding to the update on req->rsk_listener though. > + } > + spin_unlock(&queue->rskq_lock); > + > + return migrated; > +} > + > struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child, > struct request_sock *req, bool own_req) > { > @@ -1023,9 +1046,11 @@ EXPORT_SYMBOL(inet_csk_complete_hashdance); > */ > void inet_csk_listen_stop(struct sock *sk) > { > + struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb); > struct inet_connection_sock *icsk = inet_csk(sk); > struct request_sock_queue *queue = &icsk->icsk_accept_queue; > struct request_sock *next, *req; > + struct sock *nsk; > > /* Following specs, it would be better either to send FIN > * (and enter FIN-WAIT-1, it is normal close) > @@ -1043,8 +1068,19 @@ void inet_csk_listen_stop(struct sock *sk) > WARN_ON(sock_owned_by_user(child)); > sock_hold(child); > > + if (reuseport_cb) { > + nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, NULL); > + if (nsk) { > + if (inet_csk_reqsk_queue_migrate(sk, nsk, req)) > + goto unlock_sock; > + else > + sock_put(nsk); > + } > + } > + > inet_child_forget(sk, req, child); > reqsk_put(req); > +unlock_sock: > bh_unlock_sock(child); > local_bh_enable(); > sock_put(child); > ---8<--- > > > > > > 5. lock the accept queue of the new listener > > > > 6. splice requests and increment refcount > > > > 7. unlock > > > > > > > > Also, I think splicing is better to keep the order of requests. Adding one > > > > by one reverses it. > > > It can keep the order but I think it is orthogonal here.