On Tue, 6 Jul 2004 10:05:49 +0530 (IST) Nagendra Singh Tomar <nagendra_tomar@adaptec.com> wrote: > So, are we not doing this check incorrectly, since we are just considering > the length of skb2 to calculate the amount of to-be-sent data, while we > infact we have skb1->len+skb2->len worth of data to be sent. Yes, this is weird. The SWS avoidance implementation is split into two parts: 1) during packetization, which is what we're discussing here 2) in tcp_snd_test() It is instructive, I guess, to look at how 2.2.x tcp_sendmsg() implemented #1. The relative bits are (here it is deciding how much data to copy into the SKB we allocate, in the variable 'copy', and 'tmp' determines how large an SKB to allocate): psh = 0; copy = tp->snd_wnd - (tp->snd_nxt - tp->snd_una); This is the "usable window", ie. the equation from RFC1122 section 4.2.3.4, page 99: U = SND.UNA + SND.WND - SND.NXT Now comes the "half maximum window" test: if(copy > (tp->max_window >> 1)) { copy = min(copy, mss_now); psh = 1; } else { copy = mss_now; } if(copy > seglen) copy = seglen; 'seglen' is the length of the current iovec entry we're trying to copy in from the user. So now 'copy' is equal to "min(D,U)" from the RFC. Next, we determine if we should "queue and accumulate more data" or send it out now. /* Determine how large of a buffer to allocate. */ tmp = MAX_HEADER + sk->prot->max_header; if (copy < min(mss_now, tp->max_window >> 1) && !(flags & MSG_OOB)) { tmp += min(mss_now, tp->max_window); queue_it = 1; } else { tmp += copy; queue_it = 0; } So we will queue unless we have min(MSS, MAX_WINDOW / 2) bytes of data to put into this packet. The PSH setting algorithm looks strange, PSH is set if: 1) The usable window is larger than half of MAX_WINDOW 2) We are finishing the sendmsg() call and copying in the final bit of the user's data. #2 is fine, but #1 seems questionable. Anyways, let's get back to the specific case of that weird tcp_mark_push() call. How can it actually trigger? If the tail SKB was full MSS sized and filled, we'd first trigger this test at the top of the loop: if (!sk->sk_send_head || (copy = mss_now - skb->len) <= 0) { Which cases the "new_segment:" code to run. This eliminates that case. When we don't have a NETIF_F_SG capable device, select_size() always allocates SKB's at least MSS sized. Else it just allocates enough for the headers (because we will stick the pages into the SKB frag array for the user data). So let's look at how the tcp_mark_push() call in question can _really_ happen: if (skb_tailroom(skb) > 0) { This check must first fail. if (skb_can_coalesce(skb, i, page, off) && off != PAGE_SIZE) { This check must also fail. } else if (i == MAX_SKB_FRAGS || (!i && !(sk->sk_route_caps & NETIF_F_SG))) { And this test must pass. These three preconditions mean: 1) There is no tail room. If we just allocated the SKB, and we don't have a NETIF_F_SG device, we will always have tail room. Similarly, and existing skb at the tail of the queue, unless filled to MSS capacity already, will also have tail room in the non-NETIF_F_SG case. 2) We cannot coalesce the page. This is only relevant in the NETIF_F_SG case. nr_frags will never be non-zero in this code unless NETIF_F_SG is true. And skb_can_coalesce() always returns false when nr_frags passed in ('i' here) is zero. 3) We filled up all the frags (which really means that MAX_SKB_FRAGS is set too low, or there is a bug in the coalescing algorithm here) or essentially we are in the non-NETIF_F_SG case. This code path is rare and unusual. It could occur because of some bizarre combination of sendfile() and sendmsg() calls while TCP_CORK has been set on the socket. You would need to sendfile a byte from MAX_SKB_FRAGS - 1 different unique pages, then make a sendmsg() call before that packet gets sent out. That's beyond abnormal. :-) As mentioned previously it could occur if MAX_SKB_FRAGS is too tiny. This would be a kernel bug we should fix. I believe the current setting is large enough. For the non-NETIF_F_SG case, I can't see how in the world this code path could ever run except in one case: the case where a route change causes the output device to change and thus NETIF_F_SG goes from on to off during packetization. So I believe that the tcp_mark_push() invocation that actually occurs in practice is the other one in this function: if (forced_push(tp)) { tcp_mark_push(tp, skb); __tcp_push_pending_frames(sk, tp, mss_now, TCP_NAGLE_PUSH); And the one that you have issue with is really just an attempt to reset the SWS and PSH setting state machine when an unusual packetization failure occurs. Comments? - : send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html