Re: [Last-Call] [tcpm] Genart last call review of draft-ietf-tcpm-rto-consider-14

Stewart Bryant <stewart.bryant@xxxxxxxxx> · Mon, 8 Jun 2020 13:57:37 +0100

On 6 Jun 2020, at 08:19, Gorry Fairhurst <gorry@xxxxxxxxxxxxxx> wrote:

  Please see below.

    On 05/06/2020 17:43, Mark Allman wrote:

      Hi Stewart!

Thanks for the feedback.  Sorry for the long RTT.  I had a recent
deadline and am now trying to dig out.

        Major issues:

As far as I can see this text only applies to exchanges between
applications and network support applications such as
DNS. I.e. this is targeted at layer 4 and above. Given the
religious nature of BCPs in the eyes of some reviewers, and to
prevent endless explanations by those that design routing
protocols, OAM and other lower layer sub-system I think there
needs to a scoping text in block capitals at the at the very start
of the documnet.

      I am not entirely sure what you're suggesting here.  Per note to
Tom, I am going to add a few words to the intro.  Maybe that will
help.  I think it's unlikely I'll use block capitals! :-)

        =========

      - The requirements in this document may not be appropriate in all
        cases and, therefore, inconsistent deviations may be necessary
        (hence the "SHOULD" in the last bullet).  However,
        inconsistencies MUST be (a) explained and (b) gather consensus.

SB> That can be quite an onerous obligation  and provide scope for
SB> endless argument when reviewers are not domain experts in the
SB> protocol being designed.

      This was added because another reviewer thought it was for sure
necessary.

I guess I don't understand why you'd call this 'an onerous
obligation' since presumably you'd do it anyway without this
document.  Are we ramming things through without consensus?  If not
(my assumption), (b) is no sweat.  Are we ramming things through
without thought?  If not (my assumption), (a) is straightforward and
hopefully is being done anyway.  In other words, I don't understand
the complaint here because if you don't want to use the guidelines
then that is fine, but in going through the standard process to
define a loss detector you'll end up meeting this bullet.  Even if
this document doesn't get published or didn't exist our documents
should still be meeting this bullet.

        =======

          While there are a bevy of uses for timers in protocols---from
          rate-based pacing to connection failure detection and
          beyond---these are outside the scope of this document.

SB> I am not sure what that means for the applicability of this
SB> document.

      This was added at some point along the way because someone thought
something like rate-based pacing could be covered by the guidelines
and the intent is to say it is not.  I have zero love for this bit
and would happily remove it, but am loathe to do so because the old
comment will then come back.

    I think Mark is correct, there are many transport uses of timers,
    and calling out a small number of other uses was important to scope
    this withing the transport discussions, even if it just says "timers
    also do other stuff".

If the scope of this is explicitly transport and above I have no issues.

If it has a greater scope the scope of study and recommendations really needs to increase accordingly.

        =========

    (1) As we note above, loss detection happens when a sender does not
        receive delivery confirmation within an some expected period of
        time.  In the absence of any knowledge about the latency of a
        path, the initial RTO MUST be conservatively set to no less than
        1 second.

SB> This issue may be addressed by the scoping text, but 1s is no
SB> use when you are trying to detect sub 50ms of packet loss in
SB> the infrastructure.

      We have to start somewhere when we know nothing.

I think in my thread with Tom we hit upon this notion that the
document is really about sort of arbitrary, unknown and therefore
presumed unreliable networks.  I am going to add some words to this
effect.  Does this help?

Again, for specific environments where things are more nailed down
and known, deviations are fine and explicitly OK.  But, as a general
default I think saying "when you don't know anything < 50msec is
cool" is unlikely to be appropriate.  Well, no, I think it would be
quite inappropriate, actually.

This is I think a natural discussion based on a different
      perspective. The 1 second initial starting value for a transport
      path has been there for a long time, and transport reviewers will
      frequently quote this be it for transport:  SCTP, TCP, or for
      UDP-based apps (BCP: 145 Sect 3.1.1). I'd expect this is about the
      assumed starting position for an Internet path.
True if we're talking about a link between adjacent peers, this
      is something very different. 

We do multi-hop OAM in RTG to hold the infrastructure together.

Again, my point is that if the scope is L4 and above I have no issue, but the scope seems to be wider.

        =============

    (3) Each time the RTO is used to detect a loss, the value of the RTO
        MUST be exponentially backed off such that the next firing
        requires a longer interval.  The backoff SHOULD be removed after
        either (a) the subsequent successful transmission of
        non-retransmitted data, or (b) an RTO passes without detecting
        additional losses.  The former will generally be quicker.  The
        latter covers cases where loss is detected, but not repaired.

        A maximum value MAY be placed on the RTO.  The maximum RTO MUST
        NOT be less than 60 seconds (as specified in [RFC6298]).

        This ensures network safety.

SB> This does not work in OAM applications.

      Well, OK, get consensus to do something different---which is
completely fine.  I think retransmission timers have shown
themselves to be crucial for preventing collapse and, again, as a
default I think this is our best advice.

    It should be applicable for OAM applications that use a path across
    the Internet that can change, and certainly could be bad advice for
    controlled environment. It's actually not new, BCP: 145 also speaks
    of backoff.

A common standard rule in OAM type situation is three fast packets and then back-off.

        Minor issues:

 "By waiting long enough that we are unambiguously
  certain a packet has been lost we cannot repair losses in a timely
  manner and we risk prolonging network congestion."

I have a concern here that the emphasis is on classical
operation. We are beginning to see application to run over the
network where the timely delivery of a packet is critical for
correct operation of even SoL. As a BCP the text needs to
recognise that the scope and purpose of IP is changing and that
classical learning and rules derived from them may not apply.

Also if not ruled out of scope earlier we need to be clear at this
point that things like BFD have different considerations.

    Isn't BFD is a link protocol between adjacent systems?

No, not always, you can have multiple-hop BFD.

This is infrastructure and not user data and there is a school of though that in data planes where
The same data path is used for both control and user data, the user data is sacrificial to maintaining
The infrastructure. The reason that you do backoff in these cases is not to avoid congestion but
Instead to avoid overloading the control peer, i.e. the route processor in the peer router.

- Stewart

      I am going to suggest we revisit this after I hack out a little
extra text for the intro.  You can see if that helps.

        ==========

      "- This document does not update or obsolete any existing RFC.
        These previous specifications---while generally consistent with
        the requirements in this document---reflect community consensus
        and this document does not change that consensus."

I think it needs to be clear that adherence to this RFC is not
required for minor updates and extensions to existing RFCs. Having
seen minor routing extension held up by security concerns related
to underlying protocols rather than the extension itself there is
a lot of sensitivity on this point in some quarters of the IETF.

      Um.  Do you have suggested words?  I am not much of a protocol
lawyers (thankfully!), but I am not really conjuring the case you're
concerned about.  Something like ...

  (1) RFC XXXX was published 10 years ago and violates
      rto-consider.
  (2) We want to do a XXXXbis.
  (3) The bis has to then explain why it's cool to violate
      rto-consider.

..... ?

I would say if XXXX has a loss detector that had consensus and has
been in use for a while it'd be pretty easy to get consensus for
XXXXbis that we can still use it as it has worked fine.

        It might be useful to make it clear that there are some
applications that would prefer no data to late data.

      This document is about loss detection, not what one does after
detecting.  So, we do say ...

    However, as discussed above, the detected loss need not be
    repaired

I am happy to re-enforce this point.  Text suggestions welcome.

        Nits/editorial comments:

The terminology section confuses ID-nits - I think it should be a
section in its own right later in the document.

      Yeah- id-nits as it is run when submitting doesn't flag this.  It
was flagged by someone else in LC.  Because I am old school it's
hard to renumber everything and so I was just leaving this for the
rfc-ed to do something reasonable here.

        The following nits issues need looking at

  == Missing Reference: 'RFC5681' is mentioned on line 377, but not defined

  == Unused Reference: 'RFC3940' is defined on line 515, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC4340' is defined on line 519, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC6582' is defined on line 540, but no explicit
     reference was found in the text

      I will fix all these.  Again, I was trusting the id-nits when I
submitted and these were not flagged (or, if they were it wasn't in
a way that foisted them on my screen).  But, they're easy fixes, so
thanks!

allman

      _______________________________________________
tcpm mailing list
tcpm@xxxxxxxx
https://www.ietf.org/mailman/listinfo/tcpm

-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call