Re: [Last-Call] [tcpm] Last Call: <draft-ietf-tcpm-rack-13.txt>(TheRACK-TLPlossdetectionalgorithmfor TCP) to Proposed Standard

Markku Kojo <kojo=40cs.helsinki.fi@xxxxxxxxxxxxxx> · Thu, 17 Dec 2020 02:11:06 +0200 (EET)

Hi Martin,

See inline.

On Wed, 16 Dec 2020, Martin Duke wrote:

Hi Markku,

There is a ton here, but I'll try to address the top points. Hopefully
they obviate the rest.

Sorry for being verbose. I tried to be clear but you actually removed my 
key issues/questions ;)

1.
[Markku]
"Hmm, not sure what you mean by "this is a new loss detection after
acknowledgment of new data"?
But anyway, RFC 5681 gives the general principle to reduce cwnd and
ssthresh twice if a retransmission is lost but IMHO (and I believe many
who have designed new loss recovery and CC algorithms or implemented
them
agree) that it is hard to get things right if only congestion control
principles are available and no algorithm."

[Martin]
So 6675 Sec 5 is quite explicit that there is only one cwnd reduction
per fast recovery episode, which ends once new data has been
acknowledged.

To be more precise: fast recovery ends when the current window becomes 
cumulatively acknowledged, that is,

(4.1) RecoveryPoint (= HighData at the beginning) becomes acknowledged

I believe we agree and you meant this although new data below 
RecoveryPoint may become cumulatively acknowledged already earlier 
during the fast recovery. Reno loss recovery in RFC 5681 ends, when 
(any) new data has been acknowledged.

By definition, if a retransmission is lost it is because
newer data has been acknowledged, so it's a new recovery episode.

Not sure where you have this definition? Newer than what are you 
referring to?

But, yes, if a retransmission is lost with RFC 6675 algorithm, 
it requires RTO to be detected and definitely starts a new recovery 
episode. That is, a new recovery episode is enforced by step (1.a) of 
NextSeg () which prevents retransmission if a segment that has already 
been retransmitted. If RACK-TLP is used for detecting loss with RFC 6675 
things get different in many ways, because it may detect loss of a 
retransmission. It would pretty much require an entire redesign 
of the algorith. For example, calculation of pipe does not consider 
segments that have been retransmitted more than once.

Meanwhile, during the Fast Recovery period the incoming acks implicitly
remove data from the network and therefore keep flightsize low.

Incorrect. FlightSize != pipe. Only cumulative acks remove data from 
FlightSize and new data transmitted during fast recovery inflate 
FlightSize. How FlightSize evolves depends on loss pattern as I said. 
It is also possible that FlightSize is low, it may err in both 
directions. A simple example can be used as a proof for the case where 
cwnd increases if a loss of retransmission is detected and repaired:

RFC 6675 recovery with RACK-TLP loss detection:
(contains some inaccuracies because it has not been defined how
lost rexmits are calculated into pipe)

cwnd=20; packets P1,...,P20 in flight = current window of data
[P1 dropped and rexmit of P1 will also be dropped]

DupAck w/SACK for P2 arrives
[loss of P1 detected after one RTT from original xmit of P1]
[cwnd=ssthresh=10]
P1 is rexmitted (and it logically starts next window of data)

DupAcks w/ SACK for original P3..11 arrive
DupAck w/ SACK for original P12 arrives
[cwnd-pipe = 10-9 >=1]
send P21
DupAck w/SACK for P13 arrives
send P22
...
DupAck w/SACK for P20 arrives
send P29
[FlightSize=29]

(Ack for rexmit of P1 would arrive here unless it got dropped)

DupAck w/SACK for P21 arrives
[loss of rexmit P1 detected after one RTT from rexmit of P1]

SET cwnd = ssthresh = FlightSize/2= 29/2 = 14,5

CWND INCREASES when it should be at most 5 after halving it twice!!!

We can continue to go around on our interpretation of these documents,
but fundamentally if there is ambiguity in 5681/6675 we should bis
those RFCs rather than expand the scope of RACK.

As I said earlier, I am not opposing bis, though 5681bis wuold not 
be needed, I think.

But let me repeat: if we publish RACK-TLP now without necessary warnings 
or with a correct congesion control algorithm someone will try to 
implement RACK-TLP with RFC 6675 and it will be a total mesh. The 
behavior will be unpredictable and quite likely unsafe congestion 
control behavior.

2.
[Markku]
" In short:
When with a non-RACK-TLP implementation timer (RTO) expires: cwnd=1
MSS,
and slow start is entered.
When with a RACK_TLP implementation timer (PTO) expires,
normal fast recovery is entered (unless implementing
also PRR). So no RTO recovery as explicitly stated in Sec. 7.4.1."

[Martin]
There may be a misunderstanding here. PTO is not the same as RTO, and
both mechanisms exist! The loss response to a PTO is to send a probe;
the RTO response is as with conventional TCP. In Section 7.3:

No, I don't think I misunderstood. If you call timeout with 
another name, it is still timeout. And congestion control does not 
consider which segments to send (SND.UNA vs. probe w/ higher sequence 
number), only how much is sent.

You ignored my major point where I decoupled congestion control from loss 
detection and loss recovery and compared RFC 5681 behavior to RACK-TLP 
behavior in exactly the same scenario where an entire flight is lost and 
timer expires.

Please comment why congestion control behavior is allowed to be radically 
different in these two implementations?

RFC 5681 & RFC 6298 timeout:

       RTO=SRTT+4*RTTVAR (RTO used for arming the timer)
       1. RTO timer expires
       2. cwnd=1 MSS; ssthresh=FlightSize/2; rexmit one segment
       3. Ack of rexmit sent in step 2 arrives
       4. cwnd = cwnd+1 MSS; send two segments
       ...

RACK-TLP timeout:

       PTO=min(2*SRTT,RTO) (PTO used for arming the timer)
       1. PTO times expires
       2. (cwnd=1 MSS); (re)xmit one segment
       3. Ack of (re)xmit sent in srep 2 arrives
       4. cwnd = ssthresh = FlightSize/2; send N=cwnd segments

If FlightSize is 100 segments when timer expires, congestion control is 
the same in steps 1-3, but in step 4 the standard congestion control 
allows transmitting 2 segments, while RACK-TLP would allow 
blasting 50 segments.

After attempting to send a loss probe, regardless of whether a loss
   probe was sent, the sender MUST re-arm the RTO timer, not the PTO
   timer, if FlightSize is not zero.  This ensures RTO recovery remains
   the last resort if TLP fails.
"

This does not prevent the above RACK-TLP behavior from getting realized.

So a pure RTO response exists in the case of persistent congestion that
causes losses of probes or their ACKs.

Yes, RTO response exists BUT only after RACK-TLP at least once blasts the 
network. It may well be that with smaller windows RACK-TLP is successful 
during its TLP initiated overly aggressive "fast recovery" and never 
enters RTO recovery because it may detect and repair also loss of 
rexmits. That is, it continues at too high rate even if lost rexmits 
indicate that congestion persists in successive windows of data. And 
worse, it is successful because it pushes away other compatible TCP 
flows by being too aggressive and unfair.

Even a single shot burst every time there is significant loss 
event is not acceptable, not to mention continuous aggressiveness, and 
this is exactly what RFC 2914 and RFC 5033 explicitly address and warn 
about.

Are we ignoring these BCPs that have IETF consensus?

And the other important question I'd like to have an answer:

What is the justification to modify standard TCP congestion control to 
use fast recovery instead of slow start for a case where timeout is 
needed to detect the packet losses because there is no feedback and ack 
clock is lost? RACK-TLP explicitly instructs to do so in Sec. 7.4.1.

As I noted: based on what is written in the draft it does not intend to 
change congestion control but effectively it does.

/Markku

Martin

On Wed, Dec 16, 2020 at 11:39 AM Markku Kojo <kojo@xxxxxxxxxxxxxx>
wrote:
      Hi Martin,

      On Tue, 15 Dec 2020, Martin Duke wrote:

      > Hi Markku,
      >
      > Thanks for the comments. The authors will incorporate
      many of your
      > suggestions after the IESG review.
      >
      > There's one thing I don't understand in your comments:
      >
      > " That is,
      > where can an implementer find advice for correct
      congestion control
      > actions with RACK-TLP, when:
      >
      > (1) a loss of rexmitted segment is detected
      > (2) an entire flight of data gets dropped (and detected),
      >      that is, when there is no feedback available and a
      timeout
      >      is needed to detect the loss "
      >
      > Section 9.3 is the discussion about CC, and is clear that
      the
      > implementer should use either 5681 or 6937.

      Just a cite nit: RFC 5681 provides basic CC concepts and
      some useful CC
      guidelines but given that RACK-TLP MUST implement SACK the
      algorithm in
      RFC 5681 is not that useful and an implementer quite likely
      follows
      mainly the algorithm in RFC 6675 (and not RFC 6937 at all
      if not
      implementing PRR).
      And RFC 6675 is not mentioned in Sec 9.3, though it is
      listed in the
      Sec. 4 (Requirements).

      > You went through the 6937 case in detail.

      Yes, but without correct CC actions.

      > If 5681, it's pretty clear to me that in (1) this is a
      new loss
      > detection after acknowledgment of new data, and therefore
      requires a
      > second halving of cwnd.

      Hmm, not sure what you mean by "this is a new loss
      detection after
      acknowledgment of new data"?
      But anyway, RFC 5681 gives the general principle to reduce
      cwnd and
      ssthresh twice if a retransmission is lost but IMHO (and I
      believe many
      who have designed new loss recovery and CC algorithms or
      implemented them
      agree) that it is hard to get things right if only
      congestion control
      principles are available and no algorithm.
      That's why ALL mechanisms that we have include a quite
      detailed algorithm
      with all necessary variables and actions for loss recovery
      and/or CC
      purposes (and often also pseudocode). Like this document
      does for loss
      detection.

      So the problem is that we do not have a detailed enough
      algorithm or
      rule that tells exactly what to do when a loss of rexmit is
      detected.
      Even worse, the algorithms in RFC 5681 and RFC 6675 refer
      to
      equation (4) of RFC 5681 to reduce ssthresh and cwnd when a
      loss
      requiring a congestion control action is detected:

        (cwnd =) ssthresh = FlightSize / 2)

      And RFC 5681 gives a warning not to halve cwnd in the
      equation but
      FlightSize.

      That is, this equation is what an implementer intuitively
      would use
      when reading the relevant RFCs but it gives a wrong result
      for
      outstanding data when in fast recovery (when the sender is
      in
      congestion avoidance and the equation (4) is used to halve
      cwnd, it
      gives a correct result).
      More precisely, during fast recovery FlightSize is inflated
      when new
      data is sent and reduced when segments are cumulatively
      Acked.
      What the outcome is depends on the loss pattern. In the
      worst case,
      FlightSize is signficantly larger than in the beginning of
      the fast
      recovery when FlightSize was (correctly) used to determine
      the halved
      value for cwnd and ssthresh, i.e., equation (4) may result
      in
      *increasing* cwnd upon detecting a loss of a rexmitted
      segment, instead
      of further halving it.

      A clever implementer might have no problem to have it right
      with some
      thinking but I am afraid that there will be incorrect
      implementations
      with what is currently specified. Not all implementers have
      spent
      signicicant fraction of their career in solving TCP
      peculiarities.

      > For (2), the RTO timer is still operative so
      > the RTO recovery rules would still follow.

      In short:
      When with a non-RACK-TLP implementation timer (RTO)
      expires: cwnd=1 MSS,
      and slow start is entered.
      When with a RACK_TLP implementation timer (PTO) expires,
      normal fast recovery is entered (unless implementing
      also PRR). So no RTO recovery as explicitly stated in Sec.
      7.4.1.

      This means that this document explicitly modifies standard
      TCP congestion
      control when there are no acks coming and the
      retransmission timer
      expires

      from: RTO=SRTT+4*RTTVAR (RTO used for arming the timer)
             1. RTO timer expires
             2. cwnd=1 MSS; ssthresh=FlightSize/2; rexmit one
      segment
             3. Ack of rexmit sent in step 2 arrives
             4. cwnd = cwnd+1 MSS; send two segments
             ...

      to:   PTO=min(2*SRTT,RTO) (PRO used for arming the timer)
             1. PTO times expires
             2. (cwnd=1 MSS); (re)xmit one segment
             3. Ack of (re)xmit sent in srep 2 arrives
             4. cwnd = ssthresh = FlightSize/2; send N=cwnd
      segments

      For example, if FlightSize is 100 segments when timer
      expires,
      congestion control is the same in steps 1-3, but in step 4
      the
      current standard congestion control allows transmitting 2
      segments,
      while RACK-TLP would allow blasting 50 segments.

      Question is: what is the justification to modify standard
      TCP
      congestion control to use fast recovery instead of slow
      start for a
      case where timeout is needed to detect loss because there
      is no
      feedback and ack clock is lost? The draft does not give any
      justification. This clearly is in conflict with items (0)
      and (1)
      in BCP 133 (RFC 5033).

      Furthermore, there is no implementation nor experimental
      experience
      evaluating this change. The implementation with
      experimental experience
      uses PRR (RFC 6937) which is an Experimental specification
      including a
      novel "trick" that directs PRR fast recovery to effectively
      use slow
      start in this case at hand.

      > In other words, I am not seeing a case that requires new
      congestion
      > control concepts except as discussed in 9.3.

      See above. The change in standard congestion control for
      (2).
      The draft intends not to change congestion control but
      effectively it
      does without any operational evidence.

      What's also is missing and would be very useful:

      - For (1), a hint for an implementer saying that because
      RACK-TLP is
         able to detect a loss of a rexmit unlike any other loss
      detection
         algorithm, the sender MUST react twice to congestion
      (and cite
         RFC 5681). And cite a document where necessary correct
      actions
         are described.

      - For (1), advise that an implementer needs to keep track
      when it
         detects a loss of a retransmitted segment. Current
      algorithms
         in the draft detect a loss of retransmitted segment
      exactly in
         the same way as loss of any other segment. There seems
      to be
         nothing to track when a retransmission of a
      retransmitted segment
         takes place. Therefore, the algorithms should have
      additional
         actions to correctly track when such a loss is detected.

      - For (1), discussion on how many times a loss of a
      retransmission
         of the same segment may occur and be detected. Seems
      that it
         may be possible to drop a rexmitted segment more than
      once and
         detect it also several times?  What are the
      implications?

      - If previous is possible, then the algorithm possibly also
         may detect a loss of a new segment that was sent during
      fast
         recovery? This is also loss in two successive windows of
      data,
         and cwnd MUST be lowered twice. This discussion and
      necessary
         actions to track it are missing, if such scenario is
      possible.

      > What am I missing?

      Hope the above helps.

      /Markku

<snipping the rest>

-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call