I have recently done some extensive analysis of the performance of a TCP stream that is being used by a third party application running on two Egenera servers. These servers are conventional 2 or 4 way SMP high speed (2 to 3 gig) P4s, interconnected with an atypical, high-speed network, running relatively modern (2.4.18 or 2.4.21) Linux kernels.
This investigation has revealed 3 or so issues in the Linux network area which can be problematic for (non-TCP-offloading) high-speed, low latency, large MTU networks. I will be starting threads addressing each of these areas; this message starts a thread to discuss the congestion avoidance algorithm (RFC 2581) used in the Linux TCP implementation.
The observed behavior of this application was that its data transfer rates were lower on the Egenera hardware when compared to comparable whiteboxes with gigabit ethernet. This is inconsistent with data transfer rates of more usual applications such as scp, ftp, nfs, etc, which report data rates on Egenera hardware that are faster by a factor comparable to the two different networks' physical rates (gigabit ethernet vs Egenera's high-speed network).
The difference in the usual applications (scp, ftp, nfs, etc) vs this third party application is that the third party application relies on a non-trivial bi-directional conversation during its data transfers, as apposed to the usual applications, which use uni-directional data streams. This third party application therefore exercises some TCP algorithms (delayed ack, quick ack, congestion avoidance, etc) in non-typical ways (at least with respect to the "usual applications").
Detailed analysis of TCP packets and the Linux TCP implementation showed that the sender congestion window was being opened very slowly on Egenera hardware, artificially limiting TCP throughput. The cause for the slow expansion of the sender congestion window is a combination of two factors -- very low network latency, and the following conditional in tcp_ack() in net/ipv4/tcp_input.c: prior_in_flight >= tp->snd_cwnd
This conditional prevents the congestion algorithm from executing if the number of outstanding packets is less than the sender's congestion window. However, in very low latency networks, acks can return quickly, keeping prior_in_flight very low. This would normally not cause a problem, since quick-returning acks would result in no limits on the sending TCP (it would always observe available room from both the receiver window's point of view, as well as the congestion window's point of view).
The problem occurs when the TCP application is initially very conversational (lots of little application messages go back and forth), but then switches to a traffic pattern where bursts of little application messages go one way, followed synchronously by bursts of very large application messages the other way. This, in combination with the application's desire to disable the nagle algorithm, results in TCP stream stalls for delayed acks, which are caused by a small congestion window. At this point, the congestion window will slowly open, but only one packet for every delayed ack, which kills the application's data transfer rate unless it is amortized over extremely long data transfers.
An example would probably make this more clear. In the following, TCP1 is the receiver of large amounts of application data and the sender of the little application messages, while TCP2 is the sender of large amounts of application data and the receiver of the little application messages.
- Initially, the TCP1 application and TCP2 application send lots of little messages back and forth. Due to piggy-back acks and low latency, TCP1 never observes packets_in_flight greater than its sender congestion window, so its sender congestion window stays small.
- The TCP1 application then sends lots of little messages to the TCP2 application, using nonagle. Due to the small congestion window, TCP1 only sends several packets before waiting for an ack.
- The TCP2 application will not send any data back until it receives many of the little messages, so TCP2 delays using the delayed ack mechanism, eventually acking the first couple of little packets.
- TCP1 gets the ack, increases its sender congestion window, and sends the rest of the TCP1 application's little data messages (which usually have been merged into a single packet).
- The TCP2 application now sends lots of large messages.
- Repeat the previous 4 steps.
By removing that one conditional in tcp_ack() mentioned above, the sender congestion window in TCP1 is immediately increased much above the default 2 packets in the initial step, resulting in no delayed-ack delays.
After reading over RFC 2581, and other TCP-related RFCs (793, 2861, 1323, 1337), I can find no explanation for the limit being placed on the execution of the congestion avoidance algorithm as is currently done by that one conditional in tcp_ack(). My interpretation of RFC 2581 is that during slow start (the initial congestion avoidance algorithm state), the sender congestion window should be increased by a packet at every received ack. This should continue until congestion is observed or the sender congestion window exceeds the slow start threshold, at which point the congestion avoidance algorithm should enter the congestion avoidance state.
During the congestion avoidance state, the RFC does recommend to increase the sender congestion window in step with an outstanding-packet concept. However, this already appears to be implemented in tcp_cong_avoid() via the snd_cwnd_cnt variable.
So, I am trying to grasp the reasoning behind the tcp_ack() conditional mentioned above, and recommend removing this conditional. Removing the conditional allows the sender congestion window to increase far more quickly and higher than currently, but I think that is the intent -- this should increase indefinitely until it hits ssthresh or there is congestion detected.
Any insight on this topic would be appreciated.
Ted Duffy tedward@egenera.com Egenera, Inc.
- : send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html