Re: [Last-Call] Tsvart telechat review of draft-ietf-sfc-oam-framework-13

"Joel M. Halpern" <jmh@xxxxxxxxxxxxxxx> · Wed, 20 May 2020 13:20:54 -0400

Frank, regarding your comment about SF performance, I thought the 
document was pretty clear that we consider that out of scope (c.f. the 
discussions with the various ADs.)

If you can see a place to add text, please propose text.

Thank you,
Joel

On 5/20/2020 1:10 PM, Frank Brockners (fbrockne) wrote:
Hi Nagendra,

Thanks for the detailed reply. Please see inline (..FB).

-----Original Message-----
From: Nagendra Kumar Nainar (naikumar) <naikumar@xxxxxxxxx>
Sent: Samstag, 16. Mai 2020 16:16
To: Frank Brockners (fbrockne) <fbrockne@xxxxxxxxx>; tsv-art@xxxxxxxx
Cc: sfc@xxxxxxxx; last-call@xxxxxxxx; draft-ietf-sfc-oam-framework.all@xxxxxxxx
Subject: Re: Tsvart telechat review of draft-ietf-sfc-oam-framework-13

Hi Frank,

Thank you for the review. Please see inline for the response..

     Reviewer: Frank Brockners
     Review result: Ready with Nits

     This document has been reviewed as part of the transport area review team's
     ongoing effort to review key IETF documents. These comments were written
     primarily for the transport area directors, but are copied to the document's
     authors and WG to allow them to address any issues raised and also to the
IETF
     discussion list for information.

     When done at the time of IETF Last Call, the authors should consider this
     review as part of the last-call comments they receive. Please always CC
     tsv-art@xxxxxxxx if you reply to or forward this review.

     This document provides a reference framework for OAM for SFC.

     Comments:

     Section 3.1.1 SF availability: The text makes explicit reference to multiple
     instances of a SF. Consequently, it should be defined how availability of a SF
     is computed/determined in case multiple instances are deployed.

<Nagendra> This is already clarified in the section as below:

"For cases where
    multiple instances of an SF are used to realize a given SF for the
    purpose of load sharing, SF availability can be performed by checking
    the availability of any one of those instances, or the availability
    check may be targeted at a specific instance."

This further
     leads to the question, whether availability is always a "binary" state
     (available / not-available), or could a SF be e.g. 99% available?

<Nagendra>The availability is measured as binary state. I am not sure what is
99% available. If it means getting 99 responses for 100 probes sent, I think it
falls under packet loss category which in turn is performance measurement.

...FB: Thanks. Though I'm still not entirely following. If availability is binary and I put the statements above together, what would be the availability of the following setup: There is an SF that is made up of 100 instances. 99 of these instances are powered down entirely. And the 1 instance that is "up" is alternating between servicing requests for 10min followed by not servicing requests for 10min. Would the SF be considered "available"?

Section 3.1.2
     SF performance: What is the impact of a "multiple instance SF deployment" on
SF
     performance measurement?

<Nagendra>I think we covered this in SF availability but not here. Does the
below updated text look better?

OLD:
On the one hand, the performance of any specific SF can be quantified
    by measuring the loss and delay metrics of the traffic from SFF to
    the respective SF, while on the other hand, the performance can be
    measured by leveraging the loss and delay metrics from the respective
    SFs.  The latter requires SF involvement to perform the measurement
    while the former does not.

NEW:
On the one hand, the performance of any specific SF can be quantified
    by measuring the loss and delay metrics of the traffic from SFF to
    the respective SF, while on the other hand, the performance can be
    measured by leveraging the loss and delay metrics from the respective
    SFs.  The latter requires SF involvement to perform the measurement
    while the former does not. For cases where
    multiple instances of an SF are used to realize a given SF for the
    purpose of load sharing, SF performance can be quantified by measuring
    the metrics for any one instance of SF or by measuring the metrics for
    a specific instance.

The section only talks about loss and delay as
     performance criteria. It would be good to state that other performance
criteria
     (e.g. specific to the SF, throughput, etc.) exist.

<Nagendra> We can add the below to Section 3.1.2:

NEW:
"The metrics measured to quantify the performance of the SF component is not
just limited to loss and delay. Other metrics such as throughout also exist and
the choice of metrics for performance measurement is outside the scope of this
document."

Section 3.2.1 SFC
     availability: The current definition is very focused on connectivity
     verification, i.e. it tries to answer the question: "Does my SFC transport
     packets?". IMHO we should also ask the question "Does my SFC process the
     packets correctly?" - because if packets are not processed per the SFC
     definition, we might not call the SFC available.

<Nagendra> I think this is already handled by SF availability. The end-to-end SFC
availability is verified by steering the OAM packet over the ordered set of SFs
within the SFC. This is more like daisy chaining the availability of SFs within the
SFC to determine end-to-end SFC availability. If the derived solution verifies the
SF availability not just based on the uptime but based on the service treatment, it
also answers the question "Does my SFC process the packets correctly". Let us
know if there is any further clarity required.

While 3.2.2 states that "any
     SFC-aware network device should have the ability to make performance
     measurements" a similar statement isn't found in 3.2.1. IMHO the ability for
     availability checks is probably a prerequisite for performance measurement.

<Nagendra> The ability to perform end-to-end or partial SFC availability
verification is already mentioned in section 3.2.1 as below:

" In order to perform service connectivity verification of an SFC/SFP,
    the OAM functions could be initiated from any SFC-aware network
    devices of an SFC-enabled domain for end-to-end paths, or partial
    paths terminating on a specific SF, within the SFC/SFP"

Please let us know if you have any suggestion to improve if there is a lack of
clarity.

     Section 3.2.2 SFC performance measurement: The section only mentions the
need
     for performance measurement. It misses the definition of what SFC
performance
     measurement is.

<Nagendra>

...FB: Thanks for the suggested updates, which would definitively improve the text. One problem about SFC performance remains though IMHO.
All the text so far is focused on the connectivity within a SFC - not the service itself. I.e. If you'd consider a "laundry service" - we focus a lot on how long it takes to get the clothes shipped to and from the washing machine, but we don't focus on how well the washing machine washes the clothes.
IMHO we should either expand on the performance of the SFC and SF wrt/ the service (especially given that you define a service layer in section 2) - or clearly state that the framework would just focus on connectivity between SFs.

Section 3.3. Classifier component: The section mentions the
     need for the ability to perform performance measurement of the classifier
     component. What is performance measurement of the classifier? What does
     performance measurement of the classifier component comprise?

<Nagendra>We can add the below text:

OLD:
Any SFC-aware network device should have the ability to perform
    performance measurement of the classifier component for each SFC.

NEW:
Any SFC-aware network device should have the ability to perform
    performance measurement of the classifier component for each SFC.
     The performance can be quantified by measuring the performance metrics of
the
      traffic from the classifier for each SFC/SFP.

Section 3.4. /
     3.5. Availability/PM of the underlay and overlay network: It would be good to
     add a sentence that states that the mechanisms for availability/PM which are
     offered by the technologies used by the overlay/underlay are used, rather
than
     new methods specifically for SFC would be defined.

<Nagendra>Yes, that makes sense. Please check the below text:

OLD:
Any SFC-aware network device may have the ability to perform
    availability check or performance measurement of the overlay network.

NEW:
Any SFC-aware network device may have the ability to perform
    availability check or performance measurement of the overlay network. Any
    existing OAM tools and techniques can be leveraged for this purpose.

Section 4. SFC OAM
     Functions: It would be good, if examples in section 4 could also include more
     "recent" methods such as OWAMP/TWAMP (RFC4656, RFC 5357).

<Nagendra>

OLD:
Delay within an SFC could be measured based on the time it takes for
    a packet to traverse the SFC from the ingress SFC node to the egress
    SFF.  As SFCs are unidirectional in nature, measurement of one-way
    delay [RFC7679] is important.  In order to measure one-way delay,
    time synchronization MUST be supported by means such as NTP, PTP,
    GPS, etc.

NEW:
Delay within an SFC could be measured based on the time it takes for
    a packet to traverse the SFC from the ingress SFC node to the egress
    SFF.  Measurement protocols such as One-way Active Measurement
     Protocol (OWAMP) [RFC4656], Two-way Active Measurement Protocol
    (TWAMP) [RFC5357] can be used to measure the characteristics. As
    SFCs are unidirectional in nature, measurement of one-way
    delay [RFC7679] is important.  In order to measure one-way delay,
    time synchronization MUST be supported by means such as NTP, Precision
Time Protocol (PTP),
    GPS, etc.

Section 4.4.
     Performance Measurement: Focus is entirely on the PM of the connectivity,
     rather than on the SF. How about covering PM for the SF as well?

<Nagendra> I am not sure I understand what is missing. Do you have any
suggestion for the text improvement?.

...FB: See above. This would be about adding a capability to assess how well the washing machine washes my laundry.

Section 5.1
     OAM Tool Gap Analysis:
      - Not sure what "NVo3 OAM" refers to. Could that be explained below the
table
      and in section 1.2.1?

<Nagendra> Combining this with other below queries as they appears to be
related.

- E-OAM needs to be detailed. Is seems that CFM
      (802.1ag) and not 802.3ah is refered to here.

<Nagendra> Per my understanding, 802.ah is 1-hop while 802.3ag can be more
than 1 hop and both uses Ethernet frames. So I think both are applicable here.
My response regarding E-OAM details in this section is combined below.

...FB: Maybe I missed it - but I don't see text that refers to CFM or EFM OAM. Where is this covered? IMHO we would need references to IEEE standards to avoid confusion.

- "Trace" in the "Trace" column
      need to be extended on. Is this traceroute? Paris-Traceroute? IOAM-
Loopback?

      IPPM needs to be detailed, because IPPM is not a tool as such but an IETF
WG.
      Does this refer to OWAMP/TWAMP/etc. as defined by IPPM?

<Nagendra> Combining the above queries.

OLD:
There are various OAM tool sets available to perform OAM functions
    within various layers.  These OAM functions may be used to validate
    some of the underlay and overlay networks.  Tools like ping and trace
    are in existence to perform connectivity check and tracing of
    intermediate hops in a network.  These tools support different
    network types like IP, MPLS, TRILL, etc.  There is also an effort to
    extend the tool set to provide connectivity and continuity checks
    within overlay networks.  BFD is another tool which helps in
    detecting data forwarding failures.  Table 3 below is not exhaustive

NEW:
There are various OAM tool sets available to perform OAM functions
    within various layers.  These OAM functions may be used to validate
    some of the underlay and overlay networks.  Tools like ping and trace
    are used to perform connectivity check and tracing of
    intermediate hops in a network.  These tools are already available for
    different types of networks such as IP, MPLS, TRILL, etc.

E-OAM offers OAM mechanisms such as an Ethernet continuity check for
Ethernet links. There is an effort around NVO3 OAM to provide connectivity and
continuity checks for networks that use NVO3.  BFD is used for the detection of
data plane forwarding failures.

...FB: Check whether NVO3 WG will indeed deliver a solution and "NVO3 OAM" indeed existis. If in doubt, it might be better to avoid forward looking references. Per my note above, it would be good to explicitly refer to IEEE standards as opposed to introducing a new term like "E-OAM".

The IPPM framework [RFC 2330] offers tools such as OWAMP [RFC4656] and
TWAMP [RFC5357] (collectively referred as IPPM in this section) to measure
various performance metrics. MPLS Packet Loss Measurement (LM) and Packet
Delay Measurement (DM) (collectively referred as MPLS_PM in this section)
[RFC6374] offers the ability to measure performance metrics in MPLS network.

Table 3 below is not exhaustive.

Section 6.4.3 IOAM:
     - The section states that IOAM "may be used to perform various SFC OAM
     functions as well". It would be good to expand on this statement: E.g. IOAM
     Trace-Option Type could be leveraged for SFC tracing. IOAM Direct-Export
Option
     Type could be leveraged. - How would we deal with the IOAM Active Flag
     (draft-ietf-ippm-ioam-flags-01) when used with SFC OAM?

<Nagendra> The intention of the section is to highlight the applicability of
different OAM toolsets for OAM functions at service layer. I am not sure if we
really should try explaining all the possible options within each tool. But I agree
that it is worth clarifying the availability of IOAM options for tracing. think we
can clarify that different IOAM Option-Types are available for OAM functions
such as SFC tracing. Can you check if the below looks ok?

OLD:
[I-D.ietf-sfc-ioam-nsh] defines how In-Situ OAM data fields are
    transported using NSH header.  [I-D.ietf-sfc-proof-of-transit]
    defines a mechanism to perform proof of transit to securely verify if
    a packet traversed the relevant SFP or SFC.  While the mechanism is
    defined inband (i.e., it will be included in data packets), it may be
    used to perform various SFC OAM functions as well.

NEW:
[I-D.ietf-sfc-ioam-nsh] defines how In-Situ OAM data fields are
    transported using NSH header.  [I-D.ietf-sfc-proof-of-transit]
    defines a mechanism to perform proof of transit to securely verify if
    a packet traversed the relevant SFP or SFC.  While the mechanism is
    defined inband (i.e., it will be included in data packets), IOAM Option-Types
   such as IOAM Trace Option-Types can also be used to perform other SFC OAM
function
   such as SFC tracing.

- The text states
     "In-Situ OAM could be used with O bit set": Why would IOAM be used with the
     overflow bit set for SFC OAM? For details on IOAM's O-bit, see section 4.4.1 in
     https://tools.ietf.org/html/draft-ietf-ippm-ioam-data-09.

<Nagendra> The O bit referred here is not the O bit in IOAM but the one in
NSH/Overlay header. To avoid any confusion, this can be updated as below:

OLD:
In-Situ OAM could be used with O bit set to perform SF availability
    and SFC availability or performance measurement.

NEW:
In-Situ OAM could be used with O bit in the overlay header set, to perform SF
availability
    and SFC availability or performance measurement.

... FB: Ah, ok. Given that this section is about IOAM and not NSH, I'd rather explicitly refer to NSH here. E.g. If SFC is realized using NSH, then the O-bit in the NSH header could be used to indicated OAM traffic. You could refer to https://tools.ietf.org/html/draft-ietf-sfc-ioam-nsh-03#section-4.2 explicitly.

Section 6.4.4 SFC
     Traceroute: - This section refers to an expired draft (even calling out the
     fact that the draft has exipred), but also mentions that functionality is
     available and implemented in OpenDaylight. Consider removing the references
to
     the expired draft and rather add references to OpenDaylight documents. -
IOAM
     Loopback (see draft-ietf-ippm-ioam-flags-01) could apply SFC Traceroute as
well.

<Nagendra>Ok. Let me check if I can find some reference for ODL.

     Detailed set of nits that I encountered while reading through the document
([x]
     references line number x) – hope that they are helpful in further improving the
     doc:

<Nagendra> Yes of course (.

     [global] s/an SF/a SF/ -- and similarly SFC/SFF

<Nagendra>Other RFCs uses "an SF/SFF". So the draft is updated accordingly. If
your suggestion is to substitute "a SF" to "an SF",  it is done (.

     [176] "OAM Controller" not defined

<Nagendra>We can change it as below:

OLD:
OAM controllers are assumed to be within the same administrative
    domain as the target SFC enabled domain.

NEW:
OAM controllers are SFC-aware network devices that are capable of generating
OAM packets. They are assumed to be within the same administrative domain as
the target SFC enabled domain.

     [202] Why just Virtual Machines and no containers? Suggest to make things
     generic and talk about virtual and physical entities.

<Nagendra> We changed this as virtual entities.

           This comment applies throughout the document.
     [216] Ethernet OAM: Add reference. Do you refer to physical layer Ethernet
OAM
     (802.3ah) or CFM (802.1ag)?

<Nagendra> The response was provided in the above comment section.

[243] s/uses the overlay network/uses the overlay
     network layer/

<Nagendra> Done.

[246] Could we add a few examples of "various overlay network
     technologies"? For the underlay network layer several examples are listed.

<Nagendra> Ok.

     [248] What does "mostly transparent" mean?

<Nagendra> The data plane elements connecting the overlay layer nodes may
not always process the overlay header.

...FB: How about we explain this in the document?

[254] What does "tight coupling"
     between the link layer and the physical technology mean?

<Nagendra>I am not sure I understand the nit here. Do you see any difficulty in
parsing the sentence?

...FB: Not sure what "tight coupling" means here. Could you clarify what is "tight coupling" vs. "not tight coupling"?

[255] Suggest to avoid
     terms like "popular" - popularity can change, standards stay

<Nagendra> Ok. This is changed as "Ethernet is one such choice..."

[256] Acronyms
     "POS" and "DWDM" are not defined

<Nagendra> Added.

[274] Link start/end-points don't seem to
     always align with the underlay network in the diagram

<Nagendra> Fixed it.

[287] s/may comprise
     of/may consist of/

<Nagendra>We fixed it as "may comprise"..

[288] s/but not shown/but is not shown/

<Nagendra> We fixed this as "intermediate nodes not shown...:

[307]
     s/devices/device/

<Nagendra> Done.

[308] What is a "controller"?

<Nagendra> We discussed this in the above comment section.

[314] s/includes/include/

<Nagendra>Done.

[319]
     Add hSFC to list of acronyms in section 1.2.1

<Nagendra> This is expanded in the respective section. We added it in the
acronym section as well.

[320] Add IBN to list of acronyms
     in section 1.2.1

<Nagendra> Ok, Done.

[325] s/includes/include/

<Nagendra> Done.
[359] The function/term "controller"
     requires definition.

<Nagendra> Done, as mentioned in the above comment section.

[383] s/?./?/

[398] s/get the got/got/

<Nagendra> Done.

  [461]
     s/devices/device/

<Nagendra> Done.

  [469] Does it have to be equal cost multipath at the service
     layer, or could unequal cost multipath also be an option for load-balancing?

<Nagendra>I didn’t see any discussion specific to ECMP/UCMP in the
architecture RFC.

...FB: Hmm. I did not see that RFC7665 is only about equal cost multipath.

  [521] Not sure whether the overlay network establishes the service plane. Isn't
     it that the overlay network establishes connectivity for the SFC-related
     functions in the service plane?

<Nagendra> The service layer is established over the overlay network layer. I am
not sure if it is right to say overlay network provides connectivity for service
layer (.

...FB: Overlay network is one component of the service layer, isn't it. So it is required but not sufficient.

[531] s/components/component/ [545] remove
     "underlay"

<Nagendra> Done.

[595] s/devices/device/

<Nagendra> Done.

[600] s/action/an action/

<Nagendra> Done.

[601] Expand on
     "TTL or other means" (TTL also needs to be added to acronyms in 1.2.1). Is this
     specific to NSH? Or specific to IPv4?

<Nagendra> TTL is listed as well-known abbrev in https://www.rfc-
editor.org/materials/abbrev.expansion.txt and so we left it as it is. TTL in this
document refers to NSH TTL field.

...FB: Let's ensure we refer to NSH TTL in this case. Given that SFC can be done with other means than NSH, implicit reference to NSH might be a problem.

  [630] Mention that for "approximation of
     packet loss for a given SFC can be derived" to be applicable, SFC OAM packets
     would need to be forwarded the same as live user traffic.

<Nagendra> As it is intending to derive the approximate loss value, I am not sure
if we need this additional consideration that the OAM packet would need to
follow the live user traffic. Let me know if you think otherwise.

...FB: IMHO we should - given that it is one potential complication.

  [636] Is uppercase
     "MUST" applicable to an informational document? Especially given that
     RFC2119/RFC8174 is explicitly referenced by the draft.

<Nagendra> Based on various reviewer comments, we removed the use of any
normative statement.

[666] Add MPLS, TRILL to
     acronyms in 1.2.1

<Nagendra> Ok. Done.

[678] s/exhaustive/exhaustive./

<Nagendra> Done.

[720] Is uppercase "SHOULD" applicable to an informational document?
     Especially given that RFC2119/RFC8174 is explicitly referenced by the draft.

<Nagendra> Based on various reviewer comments, we removed the use of any
normative statement.

[722] Is uppercase "MAY" applicable to an informational document? Especially
     given that RFC2119/RFC8174 is explicitly referenced by the draft.

<Nagendra> Based on various reviewer comments, we removed the use of any
normative statement.

[754]
     s/packet/packets/

[755] s/to next node/to the next node/

  [771] How does this
     requirement align with the earlier paragraph, e.g. in case a node sends an
ICMP
     reply? It would probably make sense to scope the statement to e.g. NSH.

<Nagendra> As mentioned in the statement, the node that initiates the OAM
packet must set the marker and so this statement is applicable for the initiating
node.

[806]
     s/function/functions/

<Nagendra> Done

[809] s/from relevant node/from the relevant node/

<Nagendra> Done

[810]
     s/generate ICMP/generate an ICMP/

<Nagendra> Done

[812] s/from last/from the last/

<Nagendra> Done

[830]
     s/perform continuity/perform the continuity/

<Nagendra> Done

  [834] s/with relevant/with the
     relevant

<Nagendra> Done

[835] s/perform partial SFC availability./perform a partial SFC
     availability check./

<Nagendra> Done

[851] For "In-Situ OAM data fields" add a normative
     reference to draft-ietf-ippm-ioam-data

[905] Add "CLI" to section 1.2.1
     acronyms

<Nagendra> Done

[920] Add a reference for NETCONF ->RFC6241

<Nagendra> Done

Once again, thanks a lot for the great comments.

Regards,
Nagendra

Thanks again for considering the comments in great detail. Much appreciated.

Cheers, Frank

--
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call