Re: Sender-side SWS avoidance in tcp_sendmsg()

"David S. Miller" <davem@redhat.com> · Tue, 6 Jul 2004 18:47:31 -0700

On Tue, 6 Jul 2004 10:05:49 +0530 (IST)
Nagendra Singh Tomar <nagendra_tomar@adaptec.com> wrote:

> So, are we not doing this check incorrectly, since we are just considering 
> the length of skb2 to calculate the amount of to-be-sent data, while we 
> infact we have skb1->len+skb2->len worth of data to be sent.

Yes, this is weird.  The SWS avoidance implementation is split into
two parts:

1) during packetization, which is what we're discussing here

2) in tcp_snd_test()

It is instructive, I guess, to look at how 2.2.x tcp_sendmsg()
implemented #1.  The relative bits are (here it is deciding how
much data to copy into the SKB we allocate, in the variable
'copy', and 'tmp' determines how large an SKB to allocate):

			psh = 0;
			copy = tp->snd_wnd - (tp->snd_nxt - tp->snd_una);

This is the "usable window", ie. the equation from
RFC1122 section 4.2.3.4, page 99:

	U = SND.UNA + SND.WND - SND.NXT

Now comes the "half maximum window" test:

			if(copy > (tp->max_window >> 1)) {
				copy = min(copy, mss_now);
				psh = 1;
			} else {
				copy = mss_now;
			}
			if(copy > seglen)
				copy = seglen;

'seglen' is the length of the current iovec entry we're trying
to copy in from the user.

So now 'copy' is equal to "min(D,U)" from the RFC.

Next, we determine if we should "queue and accumulate more data"
or send it out now.

			/* Determine how large of a buffer to allocate.  */
			tmp = MAX_HEADER + sk->prot->max_header;
			if (copy < min(mss_now, tp->max_window >> 1) &&
			    !(flags & MSG_OOB)) {
				tmp += min(mss_now, tp->max_window);
				queue_it = 1;
			} else {
				tmp += copy;
				queue_it = 0;
			}

So we will queue unless we have min(MSS, MAX_WINDOW / 2)
bytes of data to put into this packet.

The PSH setting algorithm looks strange, PSH is set if:

1) The usable window is larger than half of MAX_WINDOW

2) We are finishing the sendmsg() call and copying in
   the final bit of the user's data.

#2 is fine, but #1 seems questionable.

Anyways, let's get back to the specific case of that weird
tcp_mark_push() call.  How can it actually trigger?

If the tail SKB was full MSS sized and filled, we'd first
trigger this test at the top of the loop:

			if (!sk->sk_send_head ||
			    (copy = mss_now - skb->len) <= 0) {

Which cases the "new_segment:" code to run.  This eliminates
that case.

When we don't have a NETIF_F_SG capable device, select_size()
always allocates SKB's at least MSS sized.  Else it just
allocates enough for the headers (because we will stick the
pages into the SKB frag array for the user data).

So let's look at how the tcp_mark_push() call in question can
_really_ happen:

			if (skb_tailroom(skb) > 0) {

This check must first fail.

				if (skb_can_coalesce(skb, i, page, off) &&
				    off != PAGE_SIZE) {

This check must also fail.

				} else if (i == MAX_SKB_FRAGS ||
					   (!i &&
					   !(sk->sk_route_caps & NETIF_F_SG))) {
And this test must pass.

These three preconditions mean:

1) There is no tail room.  If we just allocated the SKB, and
   we don't have a NETIF_F_SG device, we will always have
   tail room.

   Similarly, and existing skb at the tail of the queue,
   unless filled to MSS capacity already, will also have
   tail room in the non-NETIF_F_SG case.

2) We cannot coalesce the page.  This is only relevant in
   the NETIF_F_SG case.  nr_frags will never be non-zero
   in this code unless NETIF_F_SG is true.  And skb_can_coalesce()
   always returns false when nr_frags passed in ('i' here) is zero.

3) We filled up all the frags (which really means that MAX_SKB_FRAGS
   is set too low, or there is a bug in the coalescing algorithm here)
   or essentially we are in the non-NETIF_F_SG case.

This code path is rare and unusual.

It could occur because of some bizarre combination of sendfile()
and sendmsg() calls while TCP_CORK has been set on the socket.
You would need to sendfile a byte from MAX_SKB_FRAGS - 1 different
unique pages, then make a sendmsg() call before that packet gets
sent out.  That's beyond abnormal. :-)

As mentioned previously it could occur if MAX_SKB_FRAGS is too
tiny.  This would be a kernel bug we should fix.  I believe the
current setting is large enough.

For the non-NETIF_F_SG case, I can't see how in the world this
code path could ever run except in one case: the case where a
route change causes the output device to change and thus
NETIF_F_SG goes from on to off during packetization.

So I believe that the tcp_mark_push() invocation that actually
occurs in practice is the other one in this function:

		if (forced_push(tp)) {
			tcp_mark_push(tp, skb);
			__tcp_push_pending_frames(sk, tp, mss_now, TCP_NAGLE_PUSH);

And the one that you have issue with is really just an
attempt to reset the SWS and PSH setting state machine
when an unusual packetization failure occurs.

Comments?
-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html