Re: TCP performance on a lossy 1Gbps link

Matthew Hodgson <matthew@xxxxxxxxxxxxx> · Tue, 20 Oct 2009 00:23:10 +0100

Hi - many thanks for the response; comments are inlined:

Leslie Rhorer wrote:
I've got a dedicated 1000Mbps link between two sites with a rtt of 7ms,
which seems to be dropping about 1 in 20000 packets (MTU of 1500 bytes).
  I've got identical boxes at either end of the link running 2.6.27
(e1000e 0.3.3.3-k6), and I've been trying to saturate the link with TCP
transfers in spite of the packet loss.

	Why?

Because I'd like to use existing TCP applications over the link (e.g. 
rsync, mysql, HTTP, ssh, etc.) and get the highest possible throughput.

I can chuck UDP at near-linespeed over the link (/dev/zero + nc), which
seems to almost saturate it at 920Mbps.  However, TCP throughput of a
single stream (/dev/zero + nc) averages about 150Mbps.  Looking at the
tcptrace time sequence graphs of a capture, the TCP window averages out
at about 3MB - although after an initial exponential ramp up, the moment
the sender realises a packet is lost, the throughput appears to be
clamped to only use about ~5% of the available window.  I assume this is
the congestion control algorithm at the sender applying a congestion
window.

	No, not really, per se.  TCP sends packets until the Tx window is
full.  The Rx host receives the packets and assembles them in order.  It
sends an ACK pointing to the highest numbered packet in the successfully
assembled stream, saving but ignoring any out-of-sequence packets.  Thus, if
the receiving host gets the first 12 and the last 6 out of 20 packets, it
sends an ACK for packet #12, and then just waits.  Having received an ACK
for #12, the Tx host moves the start of the window to packet #13, and
transmits the remaining packets up to the end of the window.  It then sits
and waits for an additional ACK.  Since packets #13 and #14 never reached
the Rx host, it also simply waits, keeping packets #15 through the end of
the window, and both hosts sit idle.  After an implementation dependent wait
period (usually about 2 seconds), the Tx host starts re-sending the entire
window contents, which in this case starts with packet #13.

Right, I follow your example - but I thought that with SACK turned on 
(as it is by default), the Rx host will immediately send ACKs when 
receiving packets #14 through #20, repeatedly ACKing receipt up to the 
beginning of packet #13 - but with a selective ACK blocks to announce 
that it has correctly received subsequent packets.  Once the Tx host 
sees three such repeats, it can assume that packet #13 was lost, and 
retransmit it - which surely only takes 1 round trip + 3 more packet 
intervals to happen, rather than the 2 seconds of a plain old 
retransmit?  Even without SACK, doesn't linux implement Fast Retransmit 
and cause the Tx host to immediately retransmit packet #13 after 
receiving 3 consecutive duplicate ACKs?

This entire process seems like it should be able to happen without 
causing enormous disruption, and whilst the window might be briefly 
blocked waiting for retransmission, it should not significantly hinder 
throughput.  That said, with a 7ms roundtrip, a 0.0006% packet loss 
rate, and sending 1500 byte packets at 1Gbps line speed, I guess that if 
packet loss is linearly distributed, you could end up with 55 packets 
lost every second - resulting in 385ms of retransmit pauses every 
second.  According to tcptrace, however, packet loss is clumpy, causing 
only a few ~7ms pauses every second.

	If a re-transmit is required, then TCP does adjust the window size
to accommodate what it presumes is congestion on the link.  It also never
starts out streaming at full bandwidth.  It continually adjusts its window
size upwards until it encounters what it interprets as congestion issues, or
the maximum window size supported by the two hosts.

Right.  I understand this as the congestion avoidance and slow start 
algorithms from RFC2581.

What else should I be doing to crank up the throughput and defeat the
congestion control?

	Why would you be trying to do this?

To get the most throughput out of the link for TCP transfers between 
existing applications.

It is true TCP works well with
congested links, but not so well with links suffering random errors.  You
aren't going to be successful in breaking the TCP handshaking parameters
without breaking TCP itself.  

Right.  I'm not trying to break the handshaking parameters - just adjust 
the extent to which the congestion window is reduced in the face of 
packet loss, admittedly at the risk of increasing packet loss when the 
link is genuinely saturated.

TCP guarantees delivery of the packets to the
application layer intact and in order.  The behavior of TCP on a dirty link
is an artifact of that requirement.  If you want to deliver at full speed,
use UDP, and have the application layer handle lost packets.  

Surely implementing reliable data transfer at the application level ends 
up being effectively the same as re-implementing TCP (although I guess 
you could miss out the congestion control, or find some 
application-layer mechanism for reserving bandwidth for the stream).

If you did not
write the application (or have a developer do it for you), and it does not
support UDP transfers, then there is nothing you can do about it.

Okay.

Could jumbo frames help?

	No.  If anything, they may make it worse.  Noisy links call for
small frames.

I'm trying jumbo frames anyway - on the hope that if the loss is 
happening per-packet, at least the congestion window will increase more 
rapidly after it collapses upon packet loss (as implied by 
http://sd.wareonearth.com/~phil/jumbo.html).  If the loss is happening 
per-bit, then it will make the packet loss appear bad enough that I 
stand a chance of getting the link itself fixed :)

is this a Bad Idea?

	Yes.  RUDE_TCP notwithstanding, there are various other ways to
guarantee data delivery than that used by TCP, and each method has its own
strengths and drawbacks.  No matter what transfer protocol is implemented,
however, guaranteeing delivery of a stream segment requires the entire
segment be assembled completely at the Rx host before moving on.

Agreed.

Consequently, this means the process must halt in some fashion before
continuing after the entire segment has been transmitted until such time as
the Tx host receives notification the entire segment was received intact.
This places an upper limit on the overall transmission rate directly
proportional to the size of the Rx buffer.

Yes, but I really don't think that this is what is slowing my throughput 
down in this instance - instead, the congestion window is clamping the 
data rate at the sender.  Looking at a tcptrace time sequence graph, I 
can see that only a small fraction of the available TCP window is ever 
used - and I can only conclude that the Tx host is just holding off on 
sending due to adhering to the artificially reduced window.

This can be done at the
application layer, or it can be done at some other layer, in this case TCP.
Handling an expectedly noisy link is definitely best done through some other
protocol than TCP, assuming such flexibility is available.

Presumably a rather perverse solution to this would be a proxy to split 
a single TCP stream into multiple streams, and then reassemble them at 
the other end - thus pushing the problem into one of having large Rx 
application-layer buffers for reassembly, using socket back-pressure to 
keep the component TCP streams sufficiently in step.  Does anyone know 
if anyone's written such a thing?  Or do people simply write off TCP 
single stream throughput through WANs like this?

M.
--
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html