Re: [Last-Call] [tcpm] Last Call: <draft-ietf-tcpm-rack-13.txt>(TheRACK-TLPlossdetectionalgorithmforTCP) to Proposed Standard

Markku Kojo <kojo=40cs.helsinki.fi@xxxxxxxxxxxxxx> · Thu, 17 Dec 2020 10:46:53 +0200 (EET)

Hi,

On Wed, 16 Dec 2020, Martin Duke wrote:

I spent a little longer looking at the specs more carefully, and I explained (1)
incorrectly in my last two messages. P21..29 are not Limited Transmit packets. 

Correct. Just normal the rule that allows sending new data during fast 
recovery.

However, unless I'm missing something else, 6675 is clear that the recovery period
does not end until the cumulative ack advances, meaning that detecting the lost
retransmission of P1 does not trigger another MD directly.

As I have said earlier, RFC 6675 does not repeat all congestion control 
principles from RFC 5681. It definitely honors the CC principle that
requires to treat a loss of a retransmission as a new congestion 
indication and another MD. I believe I am obligated to know this as a 
co-author of RFC 6675. ;)

RFC 6675 explicitly indicates that it follows RFC 5681 by stating in the 
abstract:

" ... conforms to the spirit of the current congestion control
 specification (RFC 5681 ..."

And in the intro:

  "The algorithm specified in this document is a straightforward
   SACK-based loss recovery strategy that follows the  guidelines
   set in [RFC5681] ..."

I don't think there is anything unclear in this.

RFC 6675 and all other standard congestion controls (RFC 5581 and RFC 
6582) handle a loss of a retransmission by "enforcing" RTO to detect it. 
And RTO guarantees MD. RACK-TLP changes the loss detection in this case 
and therefore the standard congestion control algorithms do not have 
actions to handle it corrrectly. That is the point.

BR,

/Markku

Thanks for this exercise! It's refreshed my memory of these details after working
on slightly different QUIC algorithms a long time.

On Wed, Dec 16, 2020, 18:55 Martin Duke <martin.h.duke@xxxxxxxxx> wrote:
(1) Flightsize: in RFC 6675. Section 5, Step 4.2:

       (4.2) ssthresh = cwnd = (FlightSize / 2)

             The congestion window (cwnd) and slow start threshold
             (ssthresh) are reduced to half of FlightSize per [RFC5681].
             Additionally, note that [RFC5681] requires that any
             segments sent as part of the Limited Transmit mechanism not
             be counted in FlightSize for the purpose of the above
             equation.

IIUC the segments P21..P29 in your example were sent because of Limited
Transmit, and so don't count. The flightsize for the purposes of (4.2) is
therefore 20 after both losses, and the cwnd does not go up on the second
loss.

(2)
" Even a single shot burst every time there is significant loss
event is not acceptable, not to mention continuous aggressiveness, and
this is exactly what RFC 2914 and RFC 5033 explicitly address and warn
about."

"Significant loss event" is the key phrase here. The intent of TLP/PTO is to
equalize the treatment of a small packet loss whether it happened in the
middle of a burst or the end. Why should an isolated loss be treated
differently based on its position in the burst? This is just a logical
extension of fast retransmit, which also modified the RTO paradigm. The
working group consensus is that this is a feature, not a bug; you're welcome
to feel otherwise but I suspect you're in the rough here.

Regards
Martin

On Wed, Dec 16, 2020 at 4:11 PM Markku Kojo <kojo@xxxxxxxxxxxxxx> wrote:
      Hi Martin,

      See inline.

      On Wed, 16 Dec 2020, Martin Duke wrote:

      > Hi Markku,
      >
      > There is a ton here, but I'll try to address the top points.
      Hopefully
      > they obviate the rest.

      Sorry for being verbose. I tried to be clear but you actually
      removed my
      key issues/questions ;)

      > 1.
      > [Markku]
      > "Hmm, not sure what you mean by "this is a new loss detection
      after
      > acknowledgment of new data"?
      > But anyway, RFC 5681 gives the general principle to reduce
      cwnd and
      > ssthresh twice if a retransmission is lost but IMHO (and I
      believe many
      > who have designed new loss recovery and CC algorithms or
      implemented
      > them
      > agree) that it is hard to get things right if only congestion
      control
      > principles are available and no algorithm."
      >
      > [Martin]
      > So 6675 Sec 5 is quite explicit that there is only one cwnd
      reduction
      > per fast recovery episode, which ends once new data has been
      > acknowledged.

      To be more precise: fast recovery ends when the current window
      becomes
      cumulatively acknowledged, that is,

      (4.1) RecoveryPoint (= HighData at the beginning) becomes
      acknowledged

      I believe we agree and you meant this although new data below
      RecoveryPoint may become cumulatively acknowledged already
      earlier
      during the fast recovery. Reno loss recovery in RFC 5681 ends,
      when
      (any) new data has been acknowledged.

      > By definition, if a retransmission is lost it is because
      > newer data has been acknowledged, so it's a new recovery
      episode.

      Not sure where you have this definition? Newer than what are you
      referring to?

      But, yes, if a retransmission is lost with RFC 6675 algorithm,
      it requires RTO to be detected and definitely starts a new
      recovery
      episode. That is, a new recovery episode is enforced by step
      (1.a) of
      NextSeg () which prevents retransmission if a segment that has
      already
      been retransmitted. If RACK-TLP is used for detecting loss with
      RFC 6675
      things get different in many ways, because it may detect loss of
      a
      retransmission. It would pretty much require an entire redesign
      of the algorith. For example, calculation of pipe does not
      consider
      segments that have been retransmitted more than once.

      > Meanwhile, during the Fast Recovery period the incoming acks
      implicitly
      > remove data from the network and therefore keep flightsize
      low.

      Incorrect. FlightSize != pipe. Only cumulative acks remove data
      from
      FlightSize and new data transmitted during fast recovery inflate
      FlightSize. How FlightSize evolves depends on loss pattern as I
      said.
      It is also possible that FlightSize is low, it may err in both
      directions. A simple example can be used as a proof for the case
      where
      cwnd increases if a loss of retransmission is detected and
      repaired:

      RFC 6675 recovery with RACK-TLP loss detection:
      (contains some inaccuracies because it has not been defined how
      lost rexmits are calculated into pipe)

      cwnd=20; packets P1,...,P20 in flight = current window of data
      [P1 dropped and rexmit of P1 will also be dropped]

      DupAck w/SACK for P2 arrives
      [loss of P1 detected after one RTT from original xmit of P1]
      [cwnd=ssthresh=10]
      P1 is rexmitted (and it logically starts next window of data)

      DupAcks w/ SACK for original P3..11 arrive
      DupAck w/ SACK for original P12 arrives
      [cwnd-pipe = 10-9 >=1]
      send P21
      DupAck w/SACK for P13 arrives
      send P22
      ...
      DupAck w/SACK for P20 arrives
      send P29
      [FlightSize=29]

      (Ack for rexmit of P1 would arrive here unless it got dropped)

      DupAck w/SACK for P21 arrives
      [loss of rexmit P1 detected after one RTT from rexmit of P1]

      SET cwnd = ssthresh = FlightSize/2= 29/2 = 14,5

      CWND INCREASES when it should be at most 5 after halving it
      twice!!!

      > We can continue to go around on our interpretation of these
      documents,
      > but fundamentally if there is ambiguity in 5681/6675 we should
      bis
      > those RFCs rather than expand the scope of RACK.

      As I said earlier, I am not opposing bis, though 5681bis wuold
      not
      be needed, I think.

      But let me repeat: if we publish RACK-TLP now without necessary
      warnings
      or with a correct congesion control algorithm someone will try
      to
      implement RACK-TLP with RFC 6675 and it will be a total mesh.
      The
      behavior will be unpredictable and quite likely unsafe
      congestion
      control behavior.

      > 2.
      > [Markku]
      > " In short:
      > When with a non-RACK-TLP implementation timer (RTO) expires:
      cwnd=1
      > MSS,
      > and slow start is entered.
      > When with a RACK_TLP implementation timer (PTO) expires,
      > normal fast recovery is entered (unless implementing
      > also PRR). So no RTO recovery as explicitly stated in Sec.
      7.4.1."
      >
      > [Martin]
      > There may be a misunderstanding here. PTO is not the same as
      RTO, and
      > both mechanisms exist! The loss response to a PTO is to send a
      probe;
      > the RTO response is as with conventional TCP. In Section 7.3:

      No, I don't think I misunderstood. If you call timeout with
      another name, it is still timeout. And congestion control does
      not
      consider which segments to send (SND.UNA vs. probe w/ higher
      sequence
      number), only how much is sent.

      You ignored my major point where I decoupled congestion control
      from loss
      detection and loss recovery and compared RFC 5681 behavior to
      RACK-TLP
      behavior in exactly the same scenario where an entire flight is
      lost and
      timer expires.

      Please comment why congestion control behavior is allowed to be
      radically
      different in these two implementations?

      RFC 5681 & RFC 6298 timeout:

              RTO=SRTT+4*RTTVAR (RTO used for arming the timer)
             1. RTO timer expires
             2. cwnd=1 MSS; ssthresh=FlightSize/2; rexmit one segment
             3. Ack of rexmit sent in step 2 arrives
             4. cwnd = cwnd+1 MSS; send two segments
             ...

      RACK-TLP timeout:

              PTO=min(2*SRTT,RTO) (PTO used for arming the timer)
             1. PTO times expires
             2. (cwnd=1 MSS); (re)xmit one segment
             3. Ack of (re)xmit sent in srep 2 arrives
             4. cwnd = ssthresh = FlightSize/2; send N=cwnd segments

      If FlightSize is 100 segments when timer expires, congestion
      control is
      the same in steps 1-3, but in step 4 the standard congestion
      control
      allows transmitting 2 segments, while RACK-TLP would allow
      blasting 50 segments.

      > After attempting to send a loss probe, regardless of whether a
      loss
      >    probe was sent, the sender MUST re-arm the RTO timer, not
      the PTO
      >    timer, if FlightSize is not zero.  This ensures RTO
      recovery remains
      >    the last resort if TLP fails.
      > "

      This does not prevent the above RACK-TLP behavior from getting
      realized.

      > So a pure RTO response exists in the case of persistent
      congestion that
      > causes losses of probes or their ACKs.

      Yes, RTO response exists BUT only after RACK-TLP at least once
      blasts the
      network. It may well be that with smaller windows RACK-TLP is
      successful
      during its TLP initiated overly aggressive "fast recovery" and
      never
      enters RTO recovery because it may detect and repair also loss
      of
      rexmits. That is, it continues at too high rate even if lost
      rexmits
      indicate that congestion persists in successive windows of data.
      And
      worse, it is successful because it pushes away other compatible
      TCP
      flows by being too aggressive and unfair.

      Even a single shot burst every time there is significant loss
      event is not acceptable, not to mention continuous
      aggressiveness, and
      this is exactly what RFC 2914 and RFC 5033 explicitly address
      and warn
      about.

      Are we ignoring these BCPs that have IETF consensus?

      And the other important question I'd like to have an answer:

      What is the justification to modify standard TCP congestion
      control to
      use fast recovery instead of slow start for a case where timeout
      is
      needed to detect the packet losses because there is no feedback
      and ack
      clock is lost? RACK-TLP explicitly instructs to do so in Sec.
      7.4.1.

      As I noted: based on what is written in the draft it does not
      intend to
      change congestion control but effectively it does.

      /Markku

      > Martin
      >
      >
      > On Wed, Dec 16, 2020 at 11:39 AM Markku Kojo
      <kojo@xxxxxxxxxxxxxx>
      > wrote:
      >       Hi Martin,
      >
      >       On Tue, 15 Dec 2020, Martin Duke wrote:
      >
      >       > Hi Markku,
      >       >
      >       > Thanks for the comments. The authors will incorporate
      >       many of your
      >       > suggestions after the IESG review.
      >       >
      >       > There's one thing I don't understand in your comments:
      >       >
      >       > " That is,
      >       > where can an implementer find advice for correct
      >       congestion control
      >       > actions with RACK-TLP, when:
      >       >
      >       > (1) a loss of rexmitted segment is detected
      >       > (2) an entire flight of data gets dropped (and
      detected),
      >       >      that is, when there is no feedback available and
      a
      >       timeout
      >       >      is needed to detect the loss "
      >       >
      >       > Section 9.3 is the discussion about CC, and is clear
      that
      >       the
      >       > implementer should use either 5681 or 6937.
      >
      >       Just a cite nit: RFC 5681 provides basic CC concepts and
      >       some useful CC
      >       guidelines but given that RACK-TLP MUST implement SACK
      the
      >       algorithm in
      >       RFC 5681 is not that useful and an implementer quite
      likely
      >       follows
      >       mainly the algorithm in RFC 6675 (and not RFC 6937 at
      all
      >       if not
      >       implementing PRR).
      >       And RFC 6675 is not mentioned in Sec 9.3, though it is
      >       listed in the
      >       Sec. 4 (Requirements).
      >
      >       > You went through the 6937 case in detail.
      >
      >       Yes, but without correct CC actions.
      >
      >       > If 5681, it's pretty clear to me that in (1) this is a
      >       new loss
      >       > detection after acknowledgment of new data, and
      therefore
      >       requires a
      >       > second halving of cwnd.
      >
      >       Hmm, not sure what you mean by "this is a new loss
      >       detection after
      >       acknowledgment of new data"?
      >       But anyway, RFC 5681 gives the general principle to
      reduce
      >       cwnd and
      >       ssthresh twice if a retransmission is lost but IMHO (and
      I
      >       believe many
      >       who have designed new loss recovery and CC algorithms or
      >       implemented them
      >       agree) that it is hard to get things right if only
      >       congestion control
      >       principles are available and no algorithm.
      >       That's why ALL mechanisms that we have include a quite
      >       detailed algorithm
      >       with all necessary variables and actions for loss
      recovery
      >       and/or CC
      >       purposes (and often also pseudocode). Like this document
      >       does for loss
      >       detection.
      >
      >       So the problem is that we do not have a detailed enough
      >       algorithm or
      >       rule that tells exactly what to do when a loss of rexmit
      is
      >       detected.
      >       Even worse, the algorithms in RFC 5681 and RFC 6675
      refer
      >       to
      >       equation (4) of RFC 5681 to reduce ssthresh and cwnd
      when a
      >       loss
      >       requiring a congestion control action is detected:
      >
      >         (cwnd =) ssthresh = FlightSize / 2)
      >
      >       And RFC 5681 gives a warning not to halve cwnd in the
      >       equation but
      >       FlightSize.
      >
      >       That is, this equation is what an implementer
      intuitively
      >       would use
      >       when reading the relevant RFCs but it gives a wrong
      result
      >       for
      >       outstanding data when in fast recovery (when the sender
      is
      >       in
      >       congestion avoidance and the equation (4) is used to
      halve
      >       cwnd, it
      >       gives a correct result).
      >       More precisely, during fast recovery FlightSize is
      inflated
      >       when new
      >       data is sent and reduced when segments are cumulatively
      >       Acked.
      >       What the outcome is depends on the loss pattern. In the
      >       worst case,
      >       FlightSize is signficantly larger than in the beginning
      of
      >       the fast
      >       recovery when FlightSize was (correctly) used to
      determine
      >       the halved
      >       value for cwnd and ssthresh, i.e., equation (4) may
      result
      >       in
      >       *increasing* cwnd upon detecting a loss of a rexmitted
      >       segment, instead
      >       of further halving it.
      >
      >       A clever implementer might have no problem to have it
      right
      >       with some
      >       thinking but I am afraid that there will be incorrect
      >       implementations
      >       with what is currently specified. Not all implementers
      have
      >       spent
      >       signicicant fraction of their career in solving TCP
      >       peculiarities.
      >
      >       > For (2), the RTO timer is still operative so
      >       > the RTO recovery rules would still follow.
      >
      >       In short:
      >       When with a non-RACK-TLP implementation timer (RTO)
      >       expires: cwnd=1 MSS,
      >       and slow start is entered.
      >       When with a RACK_TLP implementation timer (PTO) expires,
      >       normal fast recovery is entered (unless implementing
      >       also PRR). So no RTO recovery as explicitly stated in
      Sec.
      >       7.4.1.
      >
      >       This means that this document explicitly modifies
      standard
      >       TCP congestion
      >       control when there are no acks coming and the
      >       retransmission timer
      >       expires
      >
      >       from: RTO=SRTT+4*RTTVAR (RTO used for arming the timer)
      >              1. RTO timer expires
      >              2. cwnd=1 MSS; ssthresh=FlightSize/2; rexmit one
      >       segment
      >              3. Ack of rexmit sent in step 2 arrives
      >              4. cwnd = cwnd+1 MSS; send two segments
      >              ...
      >
      >       to:   PTO=min(2*SRTT,RTO) (PRO used for arming the
      timer)
      >              1. PTO times expires
      >              2. (cwnd=1 MSS); (re)xmit one segment
      >              3. Ack of (re)xmit sent in srep 2 arrives
      >              4. cwnd = ssthresh = FlightSize/2; send N=cwnd
      >       segments
      >
      >       For example, if FlightSize is 100 segments when timer
      >       expires,
      >       congestion control is the same in steps 1-3, but in step
      4
      >       the
      >       current standard congestion control allows transmitting
      2
      >       segments,
      >       while RACK-TLP would allow blasting 50 segments.
      >
      >       Question is: what is the justification to modify
      standard
      >       TCP
      >       congestion control to use fast recovery instead of slow
      >       start for a
      >       case where timeout is needed to detect loss because
      there
      >       is no
      >       feedback and ack clock is lost? The draft does not give
      any
      >       justification. This clearly is in conflict with items
      (0)
      >       and (1)
      >       in BCP 133 (RFC 5033).
      >
      >       Furthermore, there is no implementation nor experimental
      >       experience
      >       evaluating this change. The implementation with
      >       experimental experience
      >       uses PRR (RFC 6937) which is an Experimental
      specification
      >       including a
      >       novel "trick" that directs PRR fast recovery to
      effectively
      >       use slow
      >       start in this case at hand.
      >
      >
      >       > In other words, I am not seeing a case that requires
      new
      >       congestion
      >       > control concepts except as discussed in 9.3.
      >
      >       See above. The change in standard congestion control for
      >       (2).
      >       The draft intends not to change congestion control but
      >       effectively it
      >       does without any operational evidence.
      >
      >       What's also is missing and would be very useful:
      >
      >       - For (1), a hint for an implementer saying that because
      >       RACK-TLP is
      >          able to detect a loss of a rexmit unlike any other
      loss
      >       detection
      >          algorithm, the sender MUST react twice to congestion
      >       (and cite
      >          RFC 5681). And cite a document where necessary
      correct
      >       actions
      >          are described.
      >
      >       - For (1), advise that an implementer needs to keep
      track
      >       when it
      >          detects a loss of a retransmitted segment. Current
      >       algorithms
      >          in the draft detect a loss of retransmitted segment
      >       exactly in
      >          the same way as loss of any other segment. There
      seems
      >       to be
      >          nothing to track when a retransmission of a
      >       retransmitted segment
      >          takes place. Therefore, the algorithms should have
      >       additional
      >          actions to correctly track when such a loss is
      detected.
      >
      >       - For (1), discussion on how many times a loss of a
      >       retransmission
      >          of the same segment may occur and be detected. Seems
      >       that it
      >          may be possible to drop a rexmitted segment more than
      >       once and
      >          detect it also several times?  What are the
      >       implications?
      >
      >       - If previous is possible, then the algorithm possibly
      also
      >          may detect a loss of a new segment that was sent
      during
      >       fast
      >          recovery? This is also loss in two successive windows
      of
      >       data,
      >          and cwnd MUST be lowered twice. This discussion and
      >       necessary
      >          actions to track it are missing, if such scenario is
      >       possible.
      >
      >       > What am I missing?
      >
      >       Hope the above helps.
      >
      >       /Markku
      >
      >
      > <snipping the rest>
      >
      >

-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call