Re: [Last-Call] [tcpm] Last Call: <draft-ietf-tcpm-rack-13.txt> (TheRACK-TLPlossdetection algorithm for TCP) to Proposed Standard

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

please see inline.

On Mon, 7 Dec 2020, Yuchung Cheng wrote:

On Mon, Dec 7, 2020 at 8:06 AM Markku Kojo <kojo@xxxxxxxxxxxxxx> wrote:

Hi Yuchung,

thanks for your reply. My major point is that IMHO this RACK-TLP
specification should give the necessary advice w.r.t congestion control
in cases when such advice is not available in the RFC series or is not
that easy to interpreted correctly from the existing RFCs.

Even though the CC principles were available in RFC series I believe you
agree with me that getting detailed CC actions correct is sometimes quite
hard, especially if one needs to interpret and apply them from another
spec than what one is implementing. Please see more details inline below.

In addition, we need to remember that this document is not meant only for
TCP experts following the tcpm list and having deep understanding of
congestion control but also those implementing TCP, for the very first
time, for various appliances, for example. They do not have first hand
experience in implementing TCP congestion control and deserve clear
advice what to do.

On Fri, 4 Dec 2020, Yuchung Cheng wrote:

On Fri, Dec 4, 2020 at 5:02 AM Markku Kojo <kojo@xxxxxxxxxxxxxx> wrote:

Hi all,

I know this is a bit late but I didn't have time earlier to take look at
this draft.

Given that this RFC to be is standards track and RECOMMENDED to replace
current DupAck-based loss detection, it is important that the spec is
clear on its advice to those implementing it. Current text seems to
lack important advice w.r.t congestion control, and even though
the spec tries to decouple loss detection from congestion control
and does not intend to modify existing standard congestion control
some of the examples advice incorrect congestion control actions.
Therefore, I think it is worth to correct the mistakes and take
yet another look at a few implications of this specification.
As you noted, the intention is to decouple the two as much as possible

Unlike the 20 years ago where TCP loss detection and congestion
control are essentially glued in one piece, the decoupling of the two
(including modularizing congestion controls in implementations) has
helped fueled many great inventions of new congestion controls.
Codifying so-called-default C.C. reactions in the loss detection is a
step backward that the authors try their best to avoid.

While I fully agree with the general principle of decoupling loss
detection from congestion control when it is possible without leaving
open questions, I find it hard to get congestion control right with this
spec in certain cases I raised just by following the current standards
track CC specifications. The reason for this is that RACK-TLP introduces
new ways to detect loss (i.e., not present in any earlier standard track
RFC) and the current CC specifications do not provide correct CC actions
for such cases as I try to point out below.

To keep the
document less "abstract / unclear" as many WGLC reviewers commented,
we use examples to illustrate that includes CC actions. But the
details of these CC actions are likely to become obsolete as CC
continues to advance hopefully.

Agreed. But I would appreciate if the CC actions in the examples would
correctly follow what is specified in the the current CC RFCs. And, I
would suggest explicitly citing the RFC(s) that each of the examples is
illustriating so that there is no doubt which CC variant the example
is valid with. Then there is no problem with the correctness of the
example either even if the cited RFC becomes later obsoleted.


Sec. 3.4 (and elsewhere when discussing recovering a dropped
retransmission):

It is very useful that RACK-TLP allows for recovering dropped rexmits.
However, it seems that the spec ignores the fact that loss of a
retransmission is a loss in a successive window that requires reacting
to congestion twice as per RFC 5681. This advice must be included in
the specification because with RACK-TLP recovery of dropped rexmit
takes place during the fast recovery which is very different
from the other standard algorithms and therefore easy to miss
when implementing this spec.

per RFC5681 sec 4.3 https://tools.ietf.org/html/rfc5681#section-4.3
"Loss in two successive windows of data, or the loss of a
 retransmission, should be taken as two indications of congestion and,
  therefore, cwnd (and ssthresh) MUST be lowered twice in this case."

RACK-TLP is a loss detection algorithm. RFC5681 is crystal clear on
this so I am not sure what clause you suggest to add to RACK-TLP.

Right, this is the CC *principle* in RFC 5681 I am refering to but I am
afraid it is not enough to help one to correctly implement such lowering
of cwnd (and ssthresh) twice when a loss of a retransmission is detected
during Fast Recovery. Nor do RFCs clearly advice *when* this reduction
must take place.

Congestion control principles tell us that congestion must be reacted
immediately when detected. But at the same time, standards track CC
specifications react to congestion only once during Fast Recovery
because the losses in the current window, which Fast Recovery repairs,
occured during the same RTT. That is, the current CC specifications do
not handle lost rexmits during Fast Recovery but, instead, the correct CC
reaction to a loss of a rexmit is automatically achieved by those
specifications via RTO when cwnd is explicitly reset upon RTO.

Furthermore, the problem I am trying to point out is that there is no
correct rule/formula available in the standards track RFCs that would
give the correct way to reduce cwnd when the loss of rexmit is detected
with RACK-TLP.

I suggest everyone reading this message stops reading at this point
for a while before continuing reading and figures out themselves what
they think would be the correct equation to use in the standards RFC
series to find the new halved value for cwnd when RACK-TLP detects a loss
of a rexmit during Fast Recovery. I would appreciate a comment on the
tcpm list from those who think they found the correct answer immediately
and easily, and what was the formula to use.

...

I think the best advice one may find by looking at RFC 6675 (and RFC
5681) is to set

  ssthresh = cwnd = (FlightSize / 2) (RFC 6675, Sec 5, algorithm step 4.2)

Now, let's modify the example in Sec 3.4 of the draft:

1. Send P0, P1, P2, ..., P20
   [Assume P1, ..., P20 dropped by network]

2.   Receive P0, ACK P0
3a.  2RTTs after (2), TLP timer fires
3b.  TLP: retransmits P20
...
5a.  Receive SACK for P20
5b.  RACK: marks P1, ..., P20 lost
      set cwnd=ssthresh=FlightSize/2=10
5c.  Retransmit P1, P2 (and some more depending on how CC implemented)
      [P1 retransmission dropped by network]

      Receive SACK P2 & P3
7a.  RACK: marks P1 retransmission lost
      As per RFC 6675: set cwnd=ssthresh=FlightSize/2=20/2=10
7b.  Retransmit P1
      ...

So, is the new value of the cwnd (10MSS) correct and halved twice? If not,
where is the correct formula to do it?

Before RFC 5681 was published we had a discussion on FlighSize and that
during Fast Recovery it does not reflect the amount of segments in
flight correctly but may also have a too large value. It was decided not
to try to correct it because it only has an impact when RTO fires during
Fast Recovery and in such a case cwnd is reset to 1 MSS. Having too large
ssthresh for RTO recovery in some cases was not considered that bad
because a TCP sender anyway takes to most conservative CC action with the
cwnd and would slow start from cwnd = 1 MSS. But now when RACK-TLP enables
detecting loss of a rexmit during Fast Recovery we have an unresolved
problem.

> 1. Send P0, P1, P2, ..., P20
   [Assume P1, ..., P20 dropped by network]

2.   Receive P0, ACK P0
3a.  2RTTs after (2), TLP timer fires
3b.  TLP: retransmits P20
...
5a.  Receive SACK for P20
5b.  RACK: marks P1, ..., P20 lost
      set cwnd=ssthresh=FlightSize/2=10
5c.  Retransmit P1, P2 (and some more depending on how CC implemented)
      [P1 retransmission dropped by network]

      Receive SACK P2 & P3
7a.  RACK: marks P1 retransmission lost
      As per RFC 6675: set cwnd=ssthresh=FlightSize/2=20/2=10
7b.  Retransmit P1

To account for your points, IMO clearly stating the existing CC RFC
interactions w/o mandating any C.C. actions are the best way to move
forward. Here are the text diff I proposed:

"Figure 1, above, illustrates  ...
Notice a subtle interaction with existing congestion control actions
on event 7a. It essentially starts another new episode of congestion
due to the detection of lost retransmission. Per RFC5681 (section 4.3)
that loss in two successive windows of data, or the loss of a
retransmission, should be taken as two indications of congestion as a
principle. But RFC6675 that introduces the pipe concept does not
specify such a second window reduction. This document reminds RACK-TLP

This is not quite correct characterization of RFC 6675.
RFC 6675 does not repeat all guidelines of RFC 5681. RFC 6675 DOES specify the two indications of congestion implicitly by clearly stating in the intro that it follows the guidelines set in RFC 5681. And RFC 5681 articulates a set of MUSTs which ALL alternative loss recovery algorithms MUST follow in order to become RFCs.

(*) RFC 5681, Sec 4.3 (pp. 12-13):

 "While this document does not standardize any of the
  specific algorithms that may improve fast retransmit/fast recovery,
  ...
  That is, ...
  Loss in two successive windows of data, or the loss of a
  retransmission, should be taken as two indications of congestion and,
  therefore, cwnd (and ssthresh) MUST be lowered twice in this case."


implementation to carefully consider the new multiple congestion
episode cases in the corresponding congestion control."

I am sorry to say that this is very fuzzy and leaves an implementer all alone what to do.

More inportantly, AFAIK we have no discussion nor consensus in the IETF on what is the correct CC action when lost rexmit is detected. Should one reset cwnd like current CCs do, or would "halving again" be fine, or something else? IMHO this requires experimentation to decide.

I assumed there is an implementation of RACK-TLP with corresponding CC actions and experimental results to devise what are the implications of the selected CC action(s) with various levels of congestion. But it seems we do not have such implementation nor experimental evidence? Am I wrong?

I sincerely apologize that this got raised this late in the prosess and I know how irritating it may be. I like the idea of RACK-TLP and by no means my intention is not to hold up the process but the lack of evidence makes me quite concerned. In particular, when this document says that the current loss detection SHOULD be replaced in all implementations with RACK-TLP.

and to emphasize in section 9.3 Interaction with congestion control

"9.3.  Interaction with congestion control

RACK-TLP intentionally decouples loss detection ... this appropriate.
As mentioned in Figure 1 caption, RFC5681 mandates a principle that
Loss in two successive windows of data, or the loss of a
retransmission, should be taken as two indications of congestion, and
therefore reacted separately. However implementation of RFC6675 pipe
algorithm may not directly account for this newly detected congestion
events properly. Therefore the documents reminds RACK-TLP
implementation to carefully consider these implications in its
corresponding congestion control.

..."





Sec 9.3:

In Section 9.3 it is stated that the only modification to the existing
congestion control algorithms is that one outstanding loss probe
can be sent even if the congestion window is fully used. This is
fine, but the spec lacks the advice that if a new data segment is sent
this extra segment MUST NOT be included when calculating the new value
of ssthresh as per the equation (4) of RFC 5681. Such segment is an
extra segment not allowed by cwnd, so it must be excluded from
FlightSize, if the TLP probe detects loss or if there is no ack
and RTO is needed to trigger loss recovery.

Why exclude TLP (or any data) from FlightSize? The congestion control
needs precise accounting of the flight size to react to congestion
properly.

Because FlightSize does not always reflect the correct amount of data
allowed by cwnd. When a TCP sender is not already in loss recovery and
it detects loss, this loss indicates the congestion point for the TCP
sender, i.e., how much data it can have outstanding. It is this amount of
data that it must use in calculating the new value of cwnd (and
ssthresh), so it must not include any data sent beyond the congestion
point. When TLP sends a new data segment, it is beyond the congestion
point and must not be included. Same holds for the segments sent via
Limited Transmit: they are allowed to be send out by the packet
conservation rule (DupAck indicates a pkt has left the network, but does
not allow increasing cwnd), i.e., the actual amount of data in flight
remains the same.

In these cases the temporary over-commit is not accounted for as DupAck
does not decrease FlightSize and in case of an RTO the next ACK comes too
late. This is similar to the rule in RFC 5681 and RFC 6675 that prohibits
including the segments transmitted via Limitid Transmit in the
calculation of ssthresh.

In Section 9.3 a few example scenarios are used to illustriate the
intended operation of RACK-TLP.

  In the first example a sender has a congestion window (cwnd) of 20
  segments on a SACK-enabled connection.  It sends 10 data segments
  and all of them are lost.

The text claims that without RACK-TLP the ending cwnd would be 4 segments
due to congestion window validation. This is incorrect.
As per RFC 7661 the sender MUST exit the non-validated phase upon an
RTO. Therefore the ending cwnd would be 5 segments (or 5 1/2 segments if
the TCP sender uses the equation (4) of RFC 5681).

The operation with RACK-TLP would inevitably result in congestion
collapse if RACK-TLP behaves as described in the example because
it restores the previous cwnd of 10 segments after the fast recovery
and would not react to congestion at all! I think this is not the
intended behavior by this spec but a mistake in the example.
The ssthresh calculated in the beginning of loss recovery should
be 5 segments as per RFC 6675 (and RFC 5681).
To clarify, would this text look more clear?

'an ending cwnd set to the slow start threshold of 5 segments (half of
the original congestion window of 10 segments)'

This is correct, but please replace:

  (half of the original congestion window of 10 segments)
-->
  (half of> the original FlightSize of 10 segments)

sure will do


cwnd in the example was 20 segments.

Please also correct the ending cwnd for "without RACK" scenario.
I poined out wrong equation number IN RFC 5681 and INCORRECT cwnd value,
my apologies. It MAENT equation (3) AND that results in ending cwnd of 5
and 2/5 MSS (not 5 and 1/2 MSS).
NB: and if a TCP sender implements entering CA when cwnd > ssthresh, then
ending cwnd would be 6 and 1/6 MSS).


Furthermore, it seems that this example with RACK-TLP refers to using
PRR_SSRB which effectively implements regular slow start in this
case(?). From congestion control point of view this is correct because
the entire flight of data as well as ack clock was lost.

However, as correctly discussed in Sec 2, congestion window must be reset
to 1 MSS when an entire flight of data is and Ack clock is lost. But how
can an implementor know what to do if she/he is not implementing the
experimental PRR algrorithm? This spec articulates specifying an
alternative for DupAck counting, indicating that TLP is used to trigger
Fast Retransmit & Fast Recovery only, not a loss recovery in slow start.
This means that without an additional advise an implementation of this
spec would just halve the cwnd and ssthresh and send a potentially very
large burst of segments in the beginning of the Fast Recovery because
there is no ack clock. So, this spec begs for an advise (MUST) when to
slow start and reset cwnd and when not, or at least a discussion of
this problem and some sort of advise what to do and what to avoid.
And, maybe a recommendation to implement it with PRR?

It's wise to decouple loss detection (RACK-TLP) vs congestion/burst
control (when to slow-start). The use of PRR is just an example to
illustrate and not meant for a recommendation.

I understand the use of PRR was just an example, but my point is that if
one wants to implement RACK-TLP but does not intend to implement PRR but
RFC 6675 then we do not have a rule in RFC 6675 to correctly implement
CC for the case when an entire flight is lost and loss is detected with
TLP. Congestion control principle for this is clear and also stated in
this draft but IMHO it is not enough to ensure correct implementation.

To my understanding we only have implementation experience for RACK-TLP
only togerher with PRR, which has the necessary rule to handle this kind
of scenario correctly.

So, my question is how can one implement CC correctly without PRR such
a scenario where entire inflight is lost?
Which rule and where in the RFC series gives the necessary guidance to
reset cwnd and slow start when TCP detets loss of an entire flight?

I think we're going in loops. To move forward it'd help if you suggest
some text you like to change.

Unfortunately I do not have an immediate solution. I assumed there is an implementation and you could enlighten what was the solution there and experimental results showing how it seems to work. If I have understood it correctly there is a solution/implementation and experience of RACK-TLP that works together with PRR but no solution/implementation nor experience how it works without PRR. Am I correct?

Given my understanding of PRR, the problem of an entire flight being dropped is quite nicely solved with it. However, implementing RACK-TLP without PRR begs for a solution. Here is my current understanding:

(1) RACK part is what can be called to "replace" DupAck counting
(2) TLP part is effectively a new retransmission timeout to detect
    a loss of an entire flight of data (i.e., it is only invoked
    when a tail of the current flight becomes lost which equals to
    the case of losing an entire flight since all segments before
    the tail loss will get cumulatively Acked and hence these segments
    are not anymore a part of the current flight at the time the loss
    is detected. And we an assume application limited TCP sender.
(3) Current CC principles require resetting cwnd in such a case
    and entering slow start (and so effectively does PRR though it is
    not explicitly stated in RFC 6937). Slow start avoids a big burst
    if the lost flight is big.
(4) RACK-TLP would possibly like to allow that cwnd is set to a half
    of the lost flight and not to slow start. This means that the bigger
    the lost flight is, the bigger is the burst that gets transmitted at
    the beginning of the recovery which is bad. So, this approach would
    need at least a rule/advice for burst avoidance (slow start, pacing,
    ...).
(5) When the lost flight is small (<= 4 segments) there is no difference
    in recovery efficiency between (3) and (4). If the lost flight is
    > 5 segments, then (4) takes less RTTs to complete the recovery
    but generates a burst. Note, even if pacing over one SRTT would be
    used, it is still a burst.

Now, it would be useful to have experimental data to know how the size of the lost flight is distributed. Is it typically just a few segments as illustriated in the examples of the draft or is it often larger?

My advice would be to add a rule that cwnd MUST be reset to 1 MSS and the sender MUST enter slow start if TLP detects loss of an entire flight. This would be safe. Otherwise, without experimental evidence from a wide range of different network conditions and workloads it feels unsafe to allow more aggressive approach.


Section 3 has a lengthy section to elaborate the key point of RACK-TLP
is to maximize the chance of fast recovery. How C.C. governs the
transmission dynamics after losses are detected are out of scope of
this document in our authors' opinions.



Another question relates to the use of TLP and adjusting timer(s) upon
timeout. In the same example discussed above, it is clear that PTO
that fires TLP is just a more aggressive retransmit timer with
an alternative data segment to (re)transmit.

Therefore, as per RFC 2914 (BCP 41), Sec 9.1, when PTO expires, it is in
effect a retransmission timout and the timer(s) must be backed-off.
This is not adviced in this specification. Whether it is the TCP RTO
or PTO that should be backed-off is an open question.  Otherwise,
if the congestion is persistent and further transmission are also lost,
RACK-TLP would not react to congestion properly but would keep
retransmitting with "constant" timer value because new RTT estimate
cannot be obtained.
On a buffer bloated and heavily congested bottleneck this would easily
result in sending at least one unnecessary retransmission per one
delivered segment which is not advisable (e.g., when there are a huge
number of applications sharing a constrained bottleneck and these
applications are sending only one (or a few) segments and then
waiting for an reply from the peer before sending another request).

Thanks for pointing to the RFC.  After TLP, RTO timers will
exp-backoff (as usual) for stability reasons mentioned in sec 9.3
(didn't find 9.1 relevant).

My apologies for refering to the wrong section of RFC 2914, Yes, I meant
Sec 9.3.

In your scenario, you presuppose the
retransmission is unnecessary so obviously TLP is not good. Consider
what happens without TLP where all the senders fire RTO spuriously and
blow up the network. It is equally unfortunate behavior. "bdp
insufficient of many flows" is a congestion control problem

If (without TLP) RTO is spurious, it may result in unnecessary
retransmissions. But we have F-RTO (RFC 5682) and Eifel (RFC 3522) to
detect and resolve it without TLP, so I don't find it as a problem.

To clarify more, what I am concerned about. Think about a scenario where a
(narrow) bottleneck becomes heavily congested by a huge number of
competing senders such that the available capacity per sender is less
than 1 segment (or << 1 MSS).
This is a situation that network first enters before congestion collapse
gets realized. So, it is extremely important that all CC and timer
mechanisms handle it properly. Regular TCP handles it via RFC 6298 by
backing off RTO expotentially and keeping this backed-off RTO until an
new ACK is received for new data. This saves RACK-TLP from full congestion
collapse. But consider what happens: even though RTO is backed off, each
time a TCP sender manages to get one segment through (with cwnd = 1 MSS)
it always first arms PTO with more or less constant value of 2*SRTT. If
the bottleneck is buffer bloated the actual RTT easily exceeds 2*SRTT and
TLP becomes spurious. After a spurious TLP, RTO expires (maybe more than
once before exponential back-off of RTO results in large enough value)
and a new RTT sample is not received. So, SRTT remains unchanged and even
if sometimes a new sample is received, SRTT gets very slowly adjusted. As
a result, each TCP sender would keep on sending a spurious TLP for each
new segment resulting in at least 50% of the packets being unnecessary
rexmitted and the utilization of the bottleneck is < 50%. This would not
be a full congestion collapse but has unwanted symptoms towards
congestion collapse (Note: there is no clear line for the level of
reduction in delivery of useful data is considered as congestion
collapse).

AFAIK you are saying: under extreme congestion shared by many short
flows, RACK-TLP can cause more packet losses because of the more
aggressive PTO timer. I agree and can add this to the "section 9.3".

No, that was not what I meant. This happens simply when enough flows are competing on the same bottleneck. It may be long flows or maybe applications with more or less continuous request-reply exhanges. E.g., if the bottleneck bit rate is 1 Mbits/s, RTT is 500msecs and PMTU is 1500B, then a bit over 40 simultaneous flows would mean that the bottleneck becomes fully utilized with the equal share of roughly 1 MSS per flow. With a quite typical bottleneck buffer size roughly equaling to BDP, about 80+ flows would fill up the buffer as well and increase RTT >= 1 secs. A buffer bloated bottleneck buffer would mean even larger RTT and allow more flows sharing the bottleneck without loss.

If the number of TCP flows is > 90 (or >> 90) RACK-TLP would ends up (almost) always unnecessarily retransmitting each new segment once (PTO being 1 sec or around 1 sec). RTO back-off after a few rounds saves each TCP sender from additional unnecessary rexmits but still ~ 50% of the deivered packets are not making any useful progress.

If the bottleneck buffer is small, it would result in more losses as you suggest but would not be that much of problem because majority of the unnecessary rexmits would not get delivered over the bottleneck link. Instead, they wuold get dropped at the bottleneck. Unnecessary rexmits would just create extra load on the poth before the bottleneck (which of course is not a non-problem either).

What authors disagree is that RTO must be back-off on the first
instance if TLP is not acked. While your suggestion helps the
congestion case, it may also hurt the recovery in other cases when the
TLP is dropped due to light/burst/transient congestion. Arguing which
scenarios matter more subjectively is not productive.

When RACK-TLP is required to be implemented by all TCP stacks, it is extremely important that it always works in a safe way and congestion is always the primary concern to get properly handled. Without arguing more, I just want to point out that TCP must work reasonably for all Internet users. That means it must not generate a situation where even a small minority of the Internet users often or almost always encounter severe problems with their connectivity.

For example, also users in developing countries where possibly an entire village shares just one mobile (cellular, maybe 3G possibly only GPRS) connection for their Internet access and pays per amount of data should get reasonable TCP behavior. In such a case, I cannot agree with engineering that results in almost always getting only half of the already scarce bandwidth whle paying double price fot the useful data.

But thinking about a way forward. Karn's algorithm would require backing off PTO even if you get Ack of PTO. Relaxing this does not sound me that bad at all, because there is often a pause before TLP and a TCP sender gets feedback, so apparently conditions are not that bad and loss recovery gets triggered (hopefully in slow start). If TLP is not acked, RTO is needed and recovery is completed in using RTO recovery in slow start. Now, if the RTO recovery is successful (no losses during RTO recovery), it should be quite likely that the TCP sender is able to successfully send also one new segment, because it enters CA when it is sending at the half of the previous rate. So, once you get Ack for the new segment, the TLP back off can be removed and it is unlikely that the TLP back off slowed down next loss detection. On the other hand, if the RTO recovery after an unsuccessful TLP is not successful (more losses are detected), it is quite likely that congestion has not been resolved. So, it is important to be conservative and have a backed-off PTO (or even turn it off) to avoid (further) unnecessary rexmits. If PTO is not backed of, I'd envision PTO mainly failing to get an Ack in such a case and thereby it not being that useful.

This certainly would require experimental data in a heavily congested setting to really figure out the actual impact of different alternatives.

So the question we need to look at is if RACK-TLP 2RTT-PTO + regular
RTO-backoff is going to cause major stability issues in extreme
congestion. My understanding based on my officemate Van Jacobson's
explanation is, as long as there's eventual exponential backoff, we'll
avoid the repeated shelling the network.

Right. But the problem is that with TLP as specified we do not have full exponential back off. It lacks Karn's clamped retransmit backoff (as it is called in Van Jacobson's seminal paper) which requires keeping backed-off timer of a retransmitted segment for the next (new) segment and ensures that there is eventual exponential backoff. Backing off just RTO is not enough, because "fixed" PTO for each new segment breaks this "back-off chain", that is, the exponential backoff is not continueed until there is evidence that congestion has been resolved (a cumulative Ack arrives for new data). But as I said, backing off RTO saves RACK-TLP from a full congestion collapse. Still, wasting 50% of the available network capacity in certain usage scenarios does not sound acceptable for me.

As a matter of fact some
major TCP (!= Linux) implementation has implemented linear backoff of
first N RTOs before exp-backoff.

But that's quite ok as long as there is eventual exponential backoff, including Karn's clamped retransmit backoff. Linear back-off in the beginning just makes resolving (heavy) congestion a bit slower.



Additional notes:

Sec 2.2:

Example 2:
"Lost retransmissions cause a  resort to RTO recovery, since
  DUPACK-counting does not detect the loss of the retransmissions.
  Then the slow start after RTO recovery could cause burst losses
  again that severely degrades performance [POLICER16]."

RTO reovery is done in slow start. The last sentence is confusing as
there is no (new) slow-start after RTO recovery (or more precisely
slow start continues until cwnd > ssthresh). Do you mean: if/when slow
start still continues after RTO Recovery has repaired lost segments,
it may cause burst losses again?
I mean the slow start after (the start of) RTO recovery. HTH

Tnx. I'd appreciate if the text could be clarified to reflect this more
accurately. Maybe something along the lines(?):

  "Then the RTO recovery in slow start could cause burst
  losses again that severely degrades performance [POLICER16]."


Example 3:
  "If the reordering degree is beyond DupThresh, the DUPACK-
   counting can cause a spurious fast recovery and unnecessary
   congestion window reduction.  To mitigate the issue, [RFC4653]
   adjusts DupThresh to half of the inflight size to tolerate the
   higher degree of reordering.  However if more than half of the
   inflight is lost, then the sender has to resort to RTO recovery."

This seems to be somewhat incorrect description of TCP-NCR specified in
RFC 4653. TCP-NCR uses Extended Limited Transmit that keeps on sending
new data segments on DupAcks that makes it likely to avoid an RTO in
the given example scenario, if not too many of the the new data
segments triggered by Extended Limited Transmit are lost.
sorry I don't see how the text is wrong describing RFC4653,
specifically the algorithm in adjusting ssthresh

To my understanding RFC4653 initializes DupThresh to half of the inflight
size in the beginning of the Extended Limited Transmit. Then on each
DupAck it adjusts (recalculates) DupThresh again such that ideally a cwnd
worth of DupAcks are received before packet loss is declared (or
reordering detected). So, if I am not incorrect, loss of a half of the
inflight does not necessarily result in RTO recovery with TCP-NCR.
Could you suggest the text you'd like on NCR description.

I'm not an expert nor closely acquainted with NCR. There might be many different packet loss patterns that may affect the behavior. So, my advice is to simply drop the last sentence starting with "However ...", because it seems incorrect and replace the second but last sentence:

  To mitigate the issue, [RFC4653]
  adjusts DupThresh to half of the inflight size to tolerate the
  higher degree of reordering.

-->

  To mitigate the issue, TCP-NCR [RFC4653]
  increases the DupThresh from the current fixed value of three duplicate
  ACKs [RFC5681] to approximately a congestion window of data having left
  the network.



Sec. 3.5:

  "For example, consider a simple case where one
  segment was sent with an RTO of 1 second, and then the application
  writes more data, causing a second and third segment to be sent right
  before the RTO of the first segment expires.  Suppose only the first
  segment is lost.  Without RACK, upon RTO expiration the sender marks
  all three segments as lost and retransmits the first segment.  When
  the sender receives the ACK that selectively acknowledges the second
  segment, the sender spuriously retransmits the third segment."

This seems incorrect. When the sender receives the ACK that selectively
acknowledges the second segment, it is a DupAck as per RFC 6675 and does
not increase cwnd and cwnd remains as 1 MSS and pipe is 1 MSS. So, the
rexmit of the third segment is not allowad until the cumulative ACK of
the first segment arrives.
I don't see where RFC6675 forbids growing cwnd. Even if it does, I
don't think it's a good thing (in RTO-slow-start) as DUPACK clearly
indicates a delivery has been made.

SACKed sequences with DUpAcks indicate that those sequences were
delivered but it does not tell when they were sent. The basic principle
of slow start is to reliably determine the available network capacity
during slow start. Therefore, slow start must ensure it uses only
segments sent during the slow start to increase cwnd. Otherwise, a TCP
sender may encounter exactly the problem of unnecessary retransmission
envisioned in this example of RACK-TCP draft (and increase cwnd on not
valid Acks).

RFC 6675 does re-specify DupAck with SACK option but it does not include
the rule for slow start. Slow start is specified in RFC 5681. It is
crystal clear in allowing increase in cwnd only on cumulative
Acks, i.e., forbidding to increase cwnd on DupAcks (RFC 5681, Sec 3,
page 6:

   During slow start, a TCP increments cwnd by at most SMSS bytes for
   each ACK received that cumulatively acknowledges new data.

Maybe this example in the RACK-TLP draft was inspired by a incorrect
implementation of SACK-based loss recovery?

FYI: when we were finalizing RFC 6675 I suggested including also an
algorithm for RTO recovery with SACK in RFC 6675. The reason was exactly
that it might be not easy to gather info from multiple documents and
hence help the implementor to have all necessary advice in a single
document. This unfortunately did not get realized though.

I am honestly lost quibbling these RFC6675 implementations and feeling
pedantic to these standards largely serve as guiding principles
instead of line-by-line code. Is cwnd one packet bigger or smaller in
this example making any different in advancing the Internet's
capability to do better loss detection. I do not think so.

Sorry, I don't understand what was quibbling here. I believe I did not argue anything about cwnd size with this example at hand in Sec. 3.5. My point is that it describes incorrect slow start behavior. Correctly implemented slow start does not have the problem illustriated in the example.

At this point please suggest a text you like to change.

That is simple. Please remove the description of the behavior without RACK. Or correct it: the behavior would be exactly the same as with RACK, but the reason for not unnecessarily retransmitting third segment can be described to be different. And, please correct also the description with RACK. With RACK, third segment does not get unnecessarly rexmitted for the reason already indicated and also because cwnd=1 MSS. And, no new segments are allowed either by cwnd=1.

Please remove also the last paragraph of the Sec. 3.5, it being also incorrect description of behavior.

BR,

/Markku


BR,

/Markku


Best regards,

/Markku



On Mon, 16 Nov 2020, The IESG wrote:


The IESG has received a request from the TCP Maintenance and Minor Extensions
WG (tcpm) to consider the following document: - 'The RACK-TLP loss detection
algorithm for TCP'
 <draft-ietf-tcpm-rack-13.txt> as Proposed Standard

The IESG plans to make a decision in the next few weeks, and solicits final
comments on this action. Please send substantive comments to the
last-call@xxxxxxxx mailing lists by 2020-11-30. Exceptionally, comments may
be sent to iesg@xxxxxxxx instead. In either case, please retain the beginning
of the Subject line to allow automated sorting.

Abstract


  This document presents the RACK-TLP loss detection algorithm for TCP.
  RACK-TLP uses per-segment transmit timestamps and selective
  acknowledgements (SACK) and has two parts: RACK ("Recent
  ACKnowledgment") starts fast recovery quickly using time-based
  inferences derived from ACK feedback.  TLP ("Tail Loss Probe")
  leverages RACK and sends a probe packet to trigger ACK feedback to
  avoid retransmission timeout (RTO) events.  Compared to the widely
  used DUPACK threshold approach, RACK-TLP detects losses more
  efficiently when there are application-limited flights of data, lost
  retransmissions, or data packet reordering events.  It is intended to
  be an alternative to the DUPACK threshold approach.




The file can be obtained via
https://datatracker.ietf.org/doc/draft-ietf-tcpm-rack/



No IPR declarations have been submitted directly on this I-D.





_______________________________________________
tcpm mailing list
tcpm@xxxxxxxx
https://www.ietf.org/mailman/listinfo/tcpm




--
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call



[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Mhonarc]     [Fedora Users]

  Powered by Linux