Oops. Sent the last one in HTML, so the mailing list rejected it.
Damned GUI email
clients!
Wei Yongjun wrote:
Doug Graham wrote:
On Fri, Jul 31, 2009 at 12:21:15PM +0800, Wei Yongjun wrote:
Doug Graham wrote:
13 2.002632 10.0.0.15 10.0.0.11 DATA (1452 bytes data)
14 2.203092 10.0.0.11 10.0.0.15 SACK
15 2.203153 10.0.0.15 10.0.0.11 DATA (2 bytes data)
16 2.203427 10.0.0.11 10.0.0.15 SACK
17 2.203808 10.0.0.11 10.0.0.15 DATA (1452 bytes data)
18 2.403524 10.0.0.15 10.0.0.11 SACK
19 2.403686 10.0.0.11 10.0.0.15 DATA (2 bytes data)
20 2.603285 10.0.0.15 10.0.0.11 SACK
What bothers me about this is that Nagle seems to be introducing a delay
here. The first DATA packets in both directions are MTU-sized packets,
yet both the Linux client and the BSD server wait 200ms until they get
the SACK to the first fragment before sending the second fragment.
The server can't send its reply until it gets both fragments, and the
client can't reassemble the reply until it gets both fragments, so from
the application's point of view, the reply doesn't arrive until 400ms
after the request is sent. This could probably be fixed by disabling
Nagle with SCTP_NODELAY, but that shouldn't be required. Nagle is only
supposed to prevent multiple outstanding *small* packets.
I think you hit the point which Nagle's algorithm should be not used.
Can you try the following patch?
[PATCH] sctp: do not used Nagle algorithm while fragmented data is transmitted
If fragmented data is sent, the Nagle's algorithm should not be
used. In special case, if only one large packet is sent, the delay
send of fragmented data will cause the receiver wait for more
fragmented data to reassembe them and not send SACK, but the sender
still wait for SACK before send the last fragment.
[patch deleted]
This patch seems to work quite well, but I think disabling Nagle
completely for large messages is not quite the right thing to do.
There's a draft-minshall-nagle-01.txt floating around that describes a
modified Nagle algorithm for TCP. It appears to have been implemented
in Linux TCP even though the draft has expired. The modified algorithm
is how I thought Nagle had always worked to begin with. From the draft:
"If a TCP has less than a full-sized packet to transmit,
and if any previously transmitted less than full-sized
packet has not yet been acknowledged, do not transmit
a packet."
so in the case of sending a fragmented SCTP message, all but the last
fragment will be full-sized and will be sent without delay. The last
fragment will usually not be full-sized, but it too will be sent without
delay because there are no outstanding non-full-sized packets.
The difference between this and your method is that yours would
allow many small fragments of big messages to be outstanding, whereas
this one would only allow the first big message to be sent in its
entirety, followed by the full-sized fragments of the next big
message. When it came time to send the second small fragment,
Nagle would force it to wait for an ACK for the first small fragment.
I'm not convinced that the difference is all that important,
but who knows.
Here's my attempt at implementing the modified Nagle algorithm described
in draft-minshall-nagle-01.txt. It should be applied instead of your
patch, not on top of it. If (q->outstanding_bytes % asoc->frag_point)
is zero, no delay is introduced. The assumption is that this means that
all outstanding packets (if any) are full-sized.
Signed-off-by: Doug Graham <dgraham@xxxxxxxxxx>
---
--- linux-2.6.29/net/sctp/output.c 2009/08/02 00:47:44 1.3
+++ linux-2.6.29/net/sctp/output.c 2009/08/02 00:51:18
@@ -717,7 +717,8 @@ static sctp_xmit_t sctp_packet_append_da
* unacknowledged.
*/
if (!sp->nodelay && sctp_packet_empty(packet) &&
- q->outstanding_bytes && sctp_state(asoc, ESTABLISHED)) {
+ (q->outstanding_bytes % asoc->frag_point) != 0 &&
+ sctp_state(asoc, ESTABLISHED)) {
unsigned len = datasize + q->out_qlen;
/* Check whether this chunk and all the rest of pending
Seem good! But it may be broken the small packet transmit which can be
used Nagle algorithm.
Such as this:
Endpoint A Endpint B
<------------- DATA (size=1452/2) delay send
<------------- DATA (size=1452/2) send immediately
<------------- DATA (size=1452/2) send immediately ** broken
<------------- DATA (size=1452/2) delay send
<------------- DATA (size=1452/2) send immediately
<------------- DATA (size=1452/2) send immediately ** broken
Can you try this one?
I would, except I don't understand what you're getting at. Does this
mean to send a total of
6 1454 byte messages from B to A? If so, why would the first one be
delayed?
Assuming that no SACKs are received by B, this should result in the
first 3 packets getting sent
immediately, a 1452 byte fragment, then a 2 byte fragment, then the
second 1452 byte fragment.
When it comes time to send the second 2 byte fragment, Nagle kicks in
and prevents if from
being sent until a SACK is received.
But I'm pretty sure I missed your point. Can you flesh it out a bit?
--Doug
--
To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html