TCP Congestion Avoidance

Ted Duffy <tedward@egenera.com> · Fri, 14 Nov 2003 11:33:05 -0500

Hi,

I have recently done some extensive analysis of the performance of a
TCP stream that is being used by a third party application running on
two Egenera servers.  These servers are conventional 2 or 4 way SMP
high speed (2 to 3 gig) P4s, interconnected with an atypical,
high-speed network, running relatively modern (2.4.18 or 2.4.21) Linux
kernels.

This investigation has revealed 3 or so issues in the Linux network
area which can be problematic for (non-TCP-offloading) high-speed, low
latency, large MTU networks.  I will be starting threads addressing
each of these areas; this message starts a thread to discuss the
congestion avoidance algorithm (RFC 2581) used in the Linux TCP
implementation.

The observed behavior of this application was that its data transfer
rates were lower on the Egenera hardware when compared to comparable
whiteboxes with gigabit ethernet.  This is inconsistent with data
transfer rates of more usual applications such as scp, ftp, nfs, etc,
which report data rates on Egenera hardware that are faster by a
factor comparable to the two different networks' physical rates
(gigabit ethernet vs Egenera's high-speed network).

The difference in the usual applications (scp, ftp, nfs, etc) vs this
third party application is that the third party application relies on
a non-trivial bi-directional conversation during its data transfers,
as apposed to the usual applications, which use uni-directional data
streams.  This third party application therefore exercises some TCP
algorithms (delayed ack, quick ack, congestion avoidance, etc) in
non-typical ways (at least with respect to the "usual applications").

Detailed analysis of TCP packets and the Linux TCP implementation
showed that the sender congestion window was being opened very slowly
on Egenera hardware, artificially limiting TCP throughput.  The cause
for the slow expansion of the sender congestion window is a
combination of two factors -- very low network latency, and the
following conditional in tcp_ack() in net/ipv4/tcp_input.c:
    prior_in_flight >= tp->snd_cwnd

This conditional prevents the congestion algorithm from executing if
the number of outstanding packets is less than the sender's
congestion window.  However, in very low latency networks, acks can
return quickly, keeping prior_in_flight very low.  This would normally
not cause a problem, since quick-returning acks would result in no
limits on the sending TCP (it would always observe available room from
both the receiver window's point of view, as well as the congestion
window's point of view).

The problem occurs when the TCP application is initially very
conversational (lots of little application messages go back and
forth), but then switches to a traffic pattern where bursts of little
application messages go one way, followed synchronously by bursts of
very large application messages the other way.  This, in combination
with the application's desire to disable the nagle algorithm, results
in TCP stream stalls for delayed acks, which are caused by a small
congestion window.  At this point, the congestion window will slowly
open, but only one packet for every delayed ack, which kills the
application's data transfer rate unless it is amortized over extremely
long data transfers.

An example would probably make this more clear.  In the following,
TCP1 is the receiver of large amounts of application data and the
sender of the little application messages, while TCP2 is the sender of
large amounts of application data and the receiver of the little
application messages.

- Initially, the TCP1 application and TCP2 application send lots of
 little messages back and forth.  Due to piggy-back acks and low
 latency, TCP1 never observes packets_in_flight greater than its
 sender congestion window, so its sender congestion window stays
 small.

- The TCP1 application then sends lots of little messages to the TCP2
 application, using nonagle.  Due to the small congestion window,
 TCP1 only sends several packets before waiting for an ack.

- The TCP2 application will not send any data back until it receives
 many of the little messages, so TCP2 delays using the delayed ack
 mechanism, eventually acking the first couple of little packets.

- TCP1 gets the ack, increases its sender congestion window, and sends
 the rest of the TCP1 application's little data messages (which
 usually have been merged into a single packet).

- The TCP2 application now sends lots of large messages.

- Repeat the previous 4 steps.

By removing that one conditional in tcp_ack() mentioned above, the
sender congestion window in TCP1 is immediately increased much above
the default 2 packets in the initial step, resulting in no delayed-ack
delays.

After reading over RFC 2581, and other TCP-related RFCs (793, 2861,
1323, 1337), I can find no explanation for the limit being placed on
the execution of the congestion avoidance algorithm as is currently
done by that one conditional in tcp_ack().  My interpretation of RFC
2581 is that during slow start (the initial congestion avoidance
algorithm state), the sender congestion window should be increased by
a packet at every received ack.  This should continue until congestion
is observed or the sender congestion window exceeds the slow start
threshold, at which point the congestion avoidance algorithm should
enter the congestion avoidance state.

During the congestion avoidance state, the RFC does recommend to
increase the sender congestion window in step with an
outstanding-packet concept.  However, this already appears to be
implemented in tcp_cong_avoid() via the snd_cwnd_cnt variable.

So, I am trying to grasp the reasoning behind the tcp_ack()
conditional mentioned above, and recommend removing this conditional.
Removing the conditional allows the sender congestion window to
increase far more quickly and higher than currently, but I think
that is the intent -- this should increase indefinitely until it hits
ssthresh or there is congestion detected.

Any insight on this topic would be appreciated.

Ted Duffy
tedward@egenera.com
Egenera, Inc.

-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html