[Last-Call] Re: [bess] Iotdir telechat review of draft-ietf-bess-evpn-fast-df-recovery-09

Luc André Burdet <laburdet.ietf@xxxxxxxxx> · Mon, 19 Aug 2024 20:24:05 +0000

Hi Toerless,

Thank you for the detailed review. I have updated the inline-comments for -10 which will be posted soon. For the itemized questions please see below. Thanks !

G.1 and G.2 : I will leave that question for a wider scope, this document merely updates existing RFCs -> and the reference to HRW is ‘en passant’ as an improvement which happened over time (perfect
 or not...)

G.3 is a very interesting proposal actually, for orderly ‘removal’ from network (maintenance operations).   I will give this some more thought with co-authors to see how to incorporate this, thanks
 for the valuable suggestion !

G.4 please note this draft currently addresses “controlled recovery” only, not “controlled failures” (as in G.3).  while technically accurate, in reality interface recovery is very rarely the
 “same millisecond” or close thereto.
In practice, even if unlatched all together interfaces recovering will also have some time gaps in between them.  To address this concern is to provide for a non-default (configured) skew to account
 for hw programing speed(s).  More pertinent though, is that this draft allows for larger non-default peering values (the 3s from base RFC) and interfaces that have known-slow-programming or a large number of subinterfaces or hosts to program can easily avail
 of a larger peering timer specific to the conditions of that ES. The SCT represents the wall-clock of this base-RFC peering timer at the recovering PE.

G.5/G.6 the variant (a) is the one I am aware of implemented by vendors: wait for NTP sync before proceeding to many or most operations in control plane, incl this peering of ethernet-segments. 
 If NTP snc becomes an issue (on router first-reload for example) delays are usually added prior to inserting the router into the network (advertising routes).  In short, NTP sync often becomes a gate to some operations -> I could add some text with a stronger
 link to clock-sync before including the SCT extended community ? 

G.7 this was always poorly written – I have updated to “substracting a positive value” throughtout- but the “break before make” is actually on purpose.  On recovery you do not want 2 interfaces
 in DF mode, that will create duplicates, loops etc.  

Regards,
Luc André

Luc André Burdet |  Cisco  |  laburdet.ietf@xxxxxxxxx  |  Tel: +1 613 254 4814

From:
Toerless Eckert via Datatracker <noreply@xxxxxxxx>

Date: Wednesday, August 14, 2024 at 12:27

To: iot-directorate@xxxxxxxx <iot-directorate@xxxxxxxx>

Cc: bess@xxxxxxxx <bess@xxxxxxxx>, draft-ietf-bess-evpn-fast-df-recovery.all@xxxxxxxx <draft-ietf-bess-evpn-fast-df-recovery.all@xxxxxxxx>, last-call@xxxxxxxx <last-call@xxxxxxxx>, evyncke@xxxxxxxxx <evyncke@xxxxxxxxx>

Subject: [bess] Iotdir telechat review of draft-ietf-bess-evpn-fast-df-recovery-09

Reviewer: Toerless Eckert

Review result: On the Right Track

Reviewer: Toerless Eckert

Summary:

The purpose of the document is to extend the BGP message signaling and local

router procedures for failover of "Designated Forwarders" for pseudowires using

calculated future timestamps and expecting clock synchronization across the

forwarders, so that after receipt of the BGP message, the switchover can be

handled autonomously by every node as synchronously as desired and allowed for

by the clock synchronization method used.

Review result: On The Right Track

I am the assigned IOTDIR reviewer.  I found the document well written and easy

to read, except for some typos, other nits and some logical description gap.

(unfortunately ?) I find the approach of the draft very useful, and i always

wished we would have been able to build this in other IETF protocol domains (IP

multicast), so i happen to have a range of technical concerns and suggestions

primarily around the completeness of the documents methods and detail

specifications, which i hope will be helpfull to improve on the quality of the

text and usefulness of the solution.

The following is a list of G.i general comments followed by the commented

idnits version of the draft.

Thank you very much for the work!

    Toerless Eckert

General comments:

G.1 minor: Why IOTdir review ?

I am a bit puzzled why this draft was given to IOTdir for early review. Neither

the draft nor the RFCs it references mentions IoT. And the mentioned pseudowire

use-cases are all around DataCenter. So i wonder what specific IoT feedback the

authors/WG is looking for. If thereactually is a specific type of use-cases for

IoT with this technology, then it would be great to mention.

G.2 minor/suggestion: HRW has known problems

HRW was popularized and (in)validated in deployments of PIM-SM since 1995 and

hence rfc2362 way before HRW1998 was written, but of course not credited in

RFC8485. I would nevertheless like to point out that the IP Multicast community

in the IETF had some run-ins with operators over the decades who where

disappointed by its non-equal distribution in face of specific typical set of

parameters such as consecutive or close to each other router-IDs. Of course,

the parameters used in EVPN are different, and i have not tried to validate if

or how such deployment specific anomalies would or could equally apply to the

EVPN version, but i would strongly suggest to be aware that HRW is by far a

well randomizing algorithm especially for the order of the input parameters.

HRW is now probably 30 years old, and maybe EVPN may wants to look into newer,

and supposedly better algorithms such as MurmurHash (which was a recommendation

from a math geek colleague even 15 years ago - and other proposals in the IETF

are picking up on it too).

G.3 minor/question: Please consider adding ordered shutdown support

If my understanding of RFC7432/RFC8584 and this draft is correct, the

interruption in case of an ordered shutdown of a DF is as large as that of an

unexpected shutdown/service interruption (without the detection of interruption

of course). I think this is not necessary.

I think it would be great if this draft could add support for the synchronized

switchover in case of ordered shutdown of a DF because such procedures

constitute likely a large number of outages in daily operations of larger

networks.

For example, the new extended community could have a flag indication of such an

ordered shutdown so that the indicated SCT will trigger synchronized failover

to the BDF (Backup DF). And only after the failover has happened would the

primary DF send out the NLRI withdraw route and finish the shutdown operation.

G.4 mayor: analysis of actual failover behavior

The mechanism of this draft seems to aspire through synchronized switchover to

achieve a switchover interruption  in the order of 10 msec (the skew default

value). I am worried that in the face of a large number of failovers (because

of a large number of VLAN/ES services), that the interruption becomes larger

and that it will be inconsistent across different services.

The way i imagine the failover to operate (from similar failovers n other

technologies like multicast), A router may fairly quickly be able to generate

the SCT carrying routes, so there can be a burst of SCT routes all with the

same SCT. When those SCT then actually expire both on the sending and receiving

router, the speed at which they are added/deleted in hardware-forwarding will

depend on the performance of updating hardware forwarding registers. Which may

be inconsistent across different routers. It is also not clear to me if the BGP

infrastructure or other factors can or can not introduce any reordering. But if

for example we have thousand routes that need to be updated, and one router can

update 1000 routes/sec and the other can update 2000 routes/sec, then one will

be done after half a second, the other after one second - no reordering assumed.

So it would be very helpfull to have some idea about the maximum imaginable

scalability required and likely min/max performances to vet the impact of this

candidate issue.

There is of course a way to overcome this issue, which is to generate SCT that

take the performance of (de)installation of hardware forwarding entries into

account, for example by assuming some floor performance and generate SCT for

such burst of service routes with timestamps increasing such that when they

will be executed, they will stay under such a performance floor. Aka: Have a

difference of e.g.: 4msec between each route, in result creating no more than

250 SCP updates/second.

In any case, it would be great if the grat target goal of this draft - less

than 10 msec interruption would not be invalidated by such real-world

performance impacts if it actually is easy to overcome it with a bit of

additional text in the draft.

G.5 mayor: Behavior upon non-synchronization.

I think the draft should do more due-diligence in its text for various

conditions of non-correct time synchronization between devices. Let first agree

on the conditions and general direction, and the i am happy to propose text if

it makes sense to the WG.

a) A router can and then should validate the state of synchronization of its

clock (in NTP for example this is typically possible via some management API,

not sure if there is already a YANG model). When restarting, the that its clock

is not synchronized to a necessary degree of accuracy yet. Minimum required

synchronization accuracy should be configurable, default maybe 3 msec. In this

case the router would wait until the synchronization is sufficient up to a

maximum time period (configurable, default maybe 30 seconds). If

synchronization is not sufficient then, revert to behave as non-draft compliant

router - and upgrade later on if and when synchronization is successful.

b) A router which is aware that it is correctly synchronized is is receiving an

SCT update from another router which did not correctly recognize its own

synchronization failure (e.g.: does not have the API to validate its local

clock being synchronized).  This condition might warrant a flag bit in the

route updates, if feasible.

To discover and work around this condition, routers will perform plausibility

check on received SCT timestamps, e.g.: validate that the received timestamp is

within a reasonable window around the local (synchronzied) clock at the time of

reception of the SCT carrying route: at least one second from current clock, at

most the configured interval (default 3 seconds), plus extensions, such as some

seconds if concern G.4 is taken into account. If ithe received SCD is out of

bounds, then the receiving router would raise some error condition and perform

some fallback failover, e.g.: within 3 seconds from reception (to avoid that

failover would happen at an imappropriately long time in the future

immediately, when SCT is in the past).

G.6 minor: some suggested NTP operational text

The following is proposed text for some NTP clock synchronization operational

considerations sections including only G.5 suggestion a). But also other

aspects crucial for successfull deployment.

----

While the use of a synchronized clock between the participating routers makes

the solution itself very simple and accurate, it does introduce a new

potentially large and complex dependency against the clock synchronization

mechanism used. Because of the use of NTP timestamps, it is not possible to

build really lightweight and autonomously operating clock synchronization

systems. Instead, one will likely need to create an operational dependency

against a clock source with automated inclusion of complexities specifically

the leap seconds, which includes satellite clock sources (Beidou, Galileo,

GLONASS or GPS), or terrestrial (DCF77, WWVB, MSF or JJY). If this dependency

is operationally already established for other purposes, then the mechanism of

this document does not provide incremental requirements except maybe for the

required accuracy. Otherwise the requirements to operate the clock

synchronization need to be analyzed.

For the mechanism of this document to provide the desired benefit,

synchronization of a few millisecond (5) or less is required, so that the skew

is sufficient to separate the break DF times from the make DF times. This

should in general not be a problem to achieve with minimal NTPv4 installations

that are aware of common pittfalls as follows.

When a router restarts, initial synchronization to other NTP server(s) is sped

up if the router has a local battery backed RTC clock from which it can derive

derive a starting time as well as the capability to step the clock to quickly

synchronize to the other NTP server(s).

If either is not possible, synchronization may take more than a few seconds

after reboot and it may be desirable to delay the bringing up DF functionality

up until the desired accuracy of clock synchronization is achieved.

Synchronization across WAN links can be subject to asymmetric latency, which

can be as high as some msec, such as for pseudowires across transcontinental

connectibity between backup DCs. Clock synchronization protocols can not

automatically figure out such asymmetric propagation latencies. If deployments

with such asymmetric latencies is required, the clock synchronization protocol

needs to have options to learn about such asymmetries, such as through

configuration.

G.7 minor: make before break instead of break before make

I think that it would make sense to define skew as configurable and explicitly

point to the option of making it positive so as to achieve "make before break"

functionality, E.g.: making the recovering router become DF slightly before the

withdrawing router.

I can think of several type of customer services that can better deal with

duplicates than with even short term losses. And unless i am overlooking some

looping issues in the broadcast domains (which i likely may), the only reason

to do break before make is IMHO services where the simultaneous sending will

result in overload. But whenever a service has a lot rate of actual user

traffic, most application will prefer a few duplicates over a few losst packets.

--

The following is idnits output to have line numbers. issues/discussions from

the review have no line numbers.

------

draft-ietf-bess-evpn-fast-df-recovery-09.txt:

  Showing Errors (**), Flaws (~~), Warnings (==), and Comments (--).

  Errors MUST be fixed before draft submission.  Flaws SHOULD be fixed before

  draft submission.

  Checking boilerplate required by RFC 5378 and the IETF Trust (see

  https://trustee.ietf.org/license-info):

  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to 
https://www.ietf.org/id-info/1id-guidelines.txt:

  ----------------------------------------------------------------------------

     No issues found here.

  Running in submission checking mode -- *not* checking nits according to

  https://www.ietf.org/id-info/checklist .

  ----------------------------------------------------------------------------

     No nits found.

--------------------------------------------------------------------------------

2       BESS Working Group                                     P. Brissette, Ed.

3       Internet-Draft                                                A. Sajassi

4       Updates: 8584 (if approved)                                   LA. Burdet

5       Intended status: Standards Track                                   Cisco

6       Expires: 9 January 2025                                         J. Drake

7                                                                    Independent

8                                                                     J. Rabadan

9                                                                          Nokia

10                                                                   8 July 2024

12                Fast Recovery for EVPN Designated Forwarder Election

13                      draft-ietf-bess-evpn-fast-df-recovery-09

15      Abstract

17         The Ethernet Virtual Private Network (EVPN) solution provides

18         Designated Forwarder (DF) election procedures for multihomed Ethernet

19         Segments.  These procedures have been enhanced further by applying

20         Highest Random Weight (HRW) algorithm for Designated Forwarder

21         election in order to avoid unnecessary DF status changes upon a

22         failure.  This document improves these procedures by providing a fast

23         Designated Forwarder election upon recovery of the failed link or

24         node associated with the multihomed Ethernet Segment.  This document

25         updates Section 2.1 of [RFC8584] by optionally introducing delays

26         between some of the events therein.

28         The solution is independent of the number of EVPN Instances (EVIs)

29         associated with that Ethernet Segment and it is performed via a

30         simple signaling between the recovered node and each of the other

31         nodes in the multihoming group.

33      Status of This Memo

35         This Internet-Draft is submitted in full conformance with the

36         provisions of BCP 78 and BCP 79.

38         Internet-Drafts are working documents of the Internet Engineering

39         Task Force (IETF).  Note that other groups may also distribute

40         working documents as Internet-Drafts.  The list of current Internet-

41         Drafts is at https://datatracker.ietf.org/drafts/current/.

43         Internet-Drafts are draft documents valid for a maximum of six months

44         and may be updated, replaced, or obsoleted by other documents at any

45         time.  It is inappropriate to use Internet-Drafts as reference

46         material or to cite them other than as "work in progress."

48         This Internet-Draft will expire on 9 January 2025.

50      Copyright Notice

52         Copyright (c) 2024 IETF Trust and the persons identified as the

55         This document is subject to BCP 78 and the IETF Trust's Legal

56         Provisions Relating to IETF Documents (https://trustee.ietf.org/

57         license-info) in effect on the date of publication of this document.

58         Please review these documents carefully, as they describe your rights

59         and restrictions with respect to this document.  Code Components

60         extracted from this document must include Revised BSD License text as

61         described in Section 4.e of the Trust Legal Provisions and are

62         provided without warranty as described in the Revised BSD License.

64      Table of Contents

66         1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2

67           1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3

68           1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3

69           1.3.  Challenges with Existing Mechanism  . . . . . . . . . . .   3

70           1.4.  Design Principles for a Solution  . . . . . . . . . . . .   5

71         2.  DF Election Synchronization Solution  . . . . . . . . . . . .   5

72           2.1.  BGP Encoding  . . . . . . . . . . . . . . . . . . . . . .   6

73           2.2.  Updates to RFC8584  . . . . . . . . . . . . . . . . . . .   7

74         3.  Synchronization Scenarios . . . . . . . . . . . . . . . . . .   8

75           3.1.  Concurrent Recoveries . . . . . . . . . . . . . . . . . .  10

76         4.  Backwards Compatibility . . . . . . . . . . . . . . . . . . .  11

77         5.  Security Considerations . . . . . . . . . . . . . . . . . . .  11

78         6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  12

79         7.  Normative References  . . . . . . . . . . . . . . . . . . . .  12

80         Appendix A.  Contributors . . . . . . . . . . . . . . . . . . . .  13

81         Appendix B.  Acknowledgements . . . . . . . . . . . . . . . . . .  13

82         Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  14

84      1.  Introduction

86         The Ethernet Virtual Private Network (EVPN) solution [RFC7432] is

87         becoming pervasive in data center (DC) applications for Network

88         Virtualization Overlay (NVO) and DC interconnect (DCI) services, and

89         in service provider (SP) applications for next generation virtual

90         private LAN services.

nit: If there is any IoT use, please mention

nit: "pervasive" is a bold statement. I do not know enough to support or doubt

it, but if there was any reference you could add to support the claim, then it

would make it stronger. Else maybe tone it down ("widely used")...

92         [RFC7432] describes Designated Frowarder (DF) election procedures for

                                           ^ typo

93         multihomed Ethernet Segments.  These procedures are enhanced further

94         in [RFC8584] by applying the Highest Random Weight (HRW) algorithm

nit:

please add the HRW1998 reference as used in RFC8584 as reference for the

term HRW and include it here.

95         for DF election in order to avoid unnecessary DF status changes upon

96         a link or node failure associated with the multihomed Ethernet

97         Segment.  This document makes further improvements to the DF election

nit: insert paragraph break before "This" (background -> contribution).

98         procedures in [RFC8584] by providing an option for a fast DF election

99         upon recovery of the failed link or node associated with the

100        multihomed Ethernet Segment.  This DF election is achieved

101        independent of the number of EVPN Instances (EVIs) associated with

102        that Ethernet Segment and it is performed via straightforward

103        signaling between the recovered node and each of the other nodes in

104        the multihomed group.

105        This document updates the DF Election Finite State Machine (FSM)

106        described in Section 2.1 of [RFC8584], by optionally introducing

107        delays between some events, as further detailed in Section 2.2.  The

108        solution is based on a simple one-way signaling mechanism.

110     1.1.  Requirements Language

112        The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",

113        "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and

114        "OPTIONAL" in this document are to be interpreted as described in BCP

115        14 [RFC2119] [RFC8174] when, and only when, they appear in all

116        capitals, as shown here.

118     1.2.  Terminology

120        PE:  Provider Edge device.

122        Designated Forwarder (DF):  A PE that is currently forwarding

123           (encapsulating/decapsulating) traffic for a given VLAN in and out

124           of a site.

126        EVI:  An EVPN instance spanning the Provider Edge (PE) devices

127           participating in that EVPN.

129     1.3.  Challenges with Existing Mechanism

131        In EVPN technology, multiple Provider Edge (PE) devices have the

132        ability to encapsulate and decapsulate data belonging to the same

133        VLAN.  Under certain conditions, this may cause Layer2 duplicates and

134        potential loops if there is a momentary overlap in forwarding roles

135        between two or more PE devices, consequently leading to broadcast

136        storms.

138        EVPN [RFC7432] currently specifies timer-based synchronization among

139        PE devices within a redundancy group.  This approach can lead to

140        duplications and potential loops due to multiple Designated

141        Forwarders (DFs) if the timer interval is too short, or to packet

142        drops if the timer interval is too long.

144        Split-horizon filtering, as described in Section 8.3 of [RFC7432],

145        can prevent loops but does not address duplicates.  However, if there

146        are overlapping Designated Forwarders (DFs) of two different sites

147        simultaneously for the same VLAN, the site identifier will differ

148        when the packet re-enters the Ethernet Segment.  Consequently, the

149        split-horizon check will fail, resulting in Layer 2 loops.

minor:

 i can not find a description of this setup and problem in [RFC7342],

and the description in the paragraph above is quite terse so that i am not

sure that i would make up from scratch a fitting example. I think it would

thus be useful to provide an topology with an appropriate example of this

condition and explain the problem based on that topology example.

151        The updated Designated Forwarder (DF) procedures outlined in

152        [RFC8584] use the well-known Highest Random Weight (HRW) algorithm to

153        prevent the reshuffling of VLANs among PE devices within the

154        redundancy group during failure or recovery events.  This approach

155        minimizes the impact on VLANs not assigned to the failed or recovered

156        ports and eliminates the occurrence of loops or duplicates during

157        such events.

159        However, upon PE insertion or a port being newly added to a

160        multihomed Ethernet Segment, HRW also cannot help as a transfer of DF

161        role to the new port must occur while the old DF is still active.

163                                          +---------+

164                       +-------------+    |         |

165                       |             |    |         |

166                     / |    PE1      |----|         |   +-------------+

167                    /  |             |    |  MPLS/  |   |             |---CE3

168                   /   +-------------+    |  VxLAN/ |   |     PE3     |

169              CE1 -                       |  Cloud  |   |             |

170                   \   +-------------+    |         |---|             |

171                    \  |             |    |         |   +-------------+

172                     \ |     PE2     |----|         |

173                       |             |    |         |

174                       +-------------+    |         |

175                                          +---------+

177                       Figure 1: CE1 multihomed to PE1 and PE2.

179        In Figure 1, when PE2 is inserted in the Ethernet Segment or its

180        CE1-facing interface recovered, PE1 will transfer the DF role of some

181        VLANs to PE2 to achieve load balancing.  However, because there is no

182        handshake mechanism between PE1 and PE2, overlapping of DF roles for

183        a given VLAN is possible which leads to duplication of traffic as

184        well as Layer 2 loops.

186        Current EVPN specifications [RFC7432] and [RFC8584] rely on a timer-

187        based approach for transferring the DF role to the newly inserted

188        device.  This can cause the following issues:

190        *  Loops/Duplicates if the timer value is too short

191        *  Prolonged Traffic Blackholing if the timer value is too long

193     1.4.  Design Principles for a Solution

195        The clock-synchronization solution for fast DF recovery presented in

196        this document follows several design principles and presents

197        multiples advantages, namely:

199        *  Complex handshake signaling mechanisms and state machines are

200           avoided in favor of a simple uni-directional signaling approach.

202        *  The fast DF recovery solution maintains backwards-compatibility

203           (see Section 4) by ensuring that PEs any unrecognized new BGP

204           Extended Community.

206        *  Existing DF Election algorithms remain supported.

208        *  The fast DF recovery solution is independent of any BGP delays in

209           propagation of Ethernet Segment routes (Route Type 4)

minor:

This claim is unclear to me. There is an overall maximum for the propagation

latency plus processing time of "just" a few seconds with the default SCT

calculation, right ? And that is communicated "in conjunction with" the

Ethernet Segment routes according to your below explanation. So there is

a maximum propagation limit. And likely some serialization, timing

dependencies.... ??!!

211        *  The fast DF recovery solution is agnostic of the actual time

212           synchronization mechanism used, and normalizes to NTP for EVPN

213           signalling only.

XXX

215     2.  DF Election Synchronization Solution

217        The fast DF recovery solution relies on the concept of common clock

218        alignment between partner PEs participating in a common Ethernet

219        Segment i.e. PE1 and PE2 in Figure 1.  The main idea is to have all

220        peering PEs of that Ethernet Segment perform DF election, and apply

221        the result at the same pre-announced time.

223        The DF Election procedure, as described in [RFC7432] and as

224        optionally signalled in [RFC8584], is applied.  All PEs attached to a

225        given Ethernet Segment are clock-synchronized using a networking

226        protocol for clock synchronization (e.g., NTP, PTP).  When a new PE

227        is inserted in an Ethernet Segment or a failed PE device of the

228        Ethernet Segment recovers, that PE communicates to peering partners

229        the current time plus the value of the timer for partner discovery

230        from step 2 in Section 8.5 of [RFC7432].  This constitutes an "end

231        time" or "absolute time" as seen from the local PE.  That absolute

232        time is called the "Service Carving Time" (SCT).

234        A new BGP Extended Community, the Service Carving Timestamp is

235        advertised along with the Ethernet Segment route (RT-4) to

236        communicate the Service Carving Time to other partners.

238        Upon receipt of the new BGP Extended Community, partner PEs can

239        determine the service carving time of the newly insterted PE.  To

240        eliminate any potential for duplicate traffic or loops, the concept

241        of skew is introduced: a small time offset to ensure a controlled and

242        orderly transition when multiple Provider Edge (PE) devices are

243        involved.  The receiving partner PEs add a skew (default = -10ms) to

244        the Service Carving Time to enforce this mechanism.  The previously

245        inserted PE(s) must perform service carving first, followed shortly

246        by the newly insterted PE, after the specified skew delay.

248        To summarize, all peering PEs perform service carving almost

249        simultaneously at the time announced by the newly added/recovered PE.

250        The newly inserted PE initiates the SCT, and triggers service carving

251        immediately on its local timer expiry.  The previously inserted PE(s)

252        receiving Ethernet Segment route (RT-4) with a SCT BGP extended

253        community, perform service carving shortly before Service Carving

254        Time.

256     2.1.  BGP Encoding

258        A new BGP extended community is defined to communicate the Service

259        Carving Timestamp for each Ethernet Segment.

261        A new transitive extended community where the Type field is 0x06, and

262        the Sub-Type is 0x0F is advertised along with the Ethernet Segment

263        route.  The expected Service Carving Time is encoded as an 8-octet

264        value as follows:

266                             1                   2                   3

267         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

268        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

269        | Type = 0x06   | Sub-Type(0x0F)|      Timestamp Seconds        ~

270        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

271        ~  Timestamp Seconds            | Timestamp Fractional Seconds  |

272        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

274                            Figure 2: Service Carving Time

276        The timestamp exchanged uses the NTP prime epoch of January 1, 1900

277        [RFC5905] and the 64-bit NTP Timestamp Format.  The NTP Era value is

278        not exchanged and Era 0 is assumed as of the writing of this

279        document.  A DF Election operation occurring exactly at the Era

280        transition boundary some time in 2036 is outside of the scope of this

281        document.

mayor:

This description effectively only supports the protocol until the end of Era 0,

because it not only describes what to do during switchover to Era N+1, but

it does not describe how to operate without encoding the Era. This makes

the protocol useful (without another RFC) for less than 12 years. That is IMHO

insufficient.

One simple solution, would be to describe that the Era is not included in the

encoding, but that a plausibility check is made on received timestamps. If it

is completely out of range with the receiving routers current Era, but within

rage with Era-1 or Era+1, then the timestamp is accordingly adjusted to use that

Era.

In another solution option, you can encode the Era by carving space from the SCT

encoding as follows:

IMHO, it is unnecessary to encode the fractional seconds with 16 bits.

The accuracy of the signalled timestamp does NOT impact the synchronized

accuracy of the execution of DF switchover. It only impacts the granularity of

timestamps that can be generated. If you would signal only the top 8 bits of

the fractional seconds, then you could still trigger a synchronized switchover

at intervals of 4 msec, which IMHO is more than necessary. And the switchover

could still be synchronized to an arbitrary better accuracy, such as 1 usec if

just the clock synchronization between the router is that good. Practically

speaking, NTP clock synchronization may often be just 1 msec accurate anyhow.

Even if you consider my thoughts from above concern G.4, and want to assign

different timestamps for every Ethernet Segment (especially with large number

of ethernet segments), then an interval of 4 msec would likely be more than

sufficient granularity.

So with just 8 bit fractional second encoding, you have 8 bit spare in the

encoding you can use for Era and other features (in the future).

282        The 64-bit NTP Timestamp Format consists of a 32-bit part for Seconds

283        and a 32-bit part for Fraction, which are encoded in the Service

284        Carving Time as follows:

286        *  Timestamp Seconds: 32-bit NTP seconds are encoded in this field.

288        *  Timestamp Fractional Seconds: the high order 16 bits of the NTP

289           'Fraction' field are encoded in this field.

291        When rebuilding a 64-bit NTP Timestamp Format using the values from a

292        received SCT BGP extended community, the lower order 16 bits of the

293        Fractional field are set to 0.  The use of a 16-bit fractional

294        seconds yields adequate precision of 15 microseconds (2^-16 s).

296        This document introduces a new flag called "T" (for Time

297        Synchronization) to the bitmap field of the DF Election Extended

298        Community defined in [RFC8584].

300                             1                   2                   3

301         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

302        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

303        | Type = 0x06   | Sub-Type(0x06)| RSV |  DF Alg | |A| |T|       ~

304        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

305        ~     Bitmap    |            Reserved = 0                       |

306        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

308                       Figure 3: DF Election Extended Community

310        *  Bit 3: Time Synchronization (corresponds to Bit 27 of the DF

311           Election Extended Community).  When set to 1, it indicates the

312           desire to use Time Synchronization capability with the rest of the

313           PEs in the Ethernet Segment.

nit:

"Bit 3" is a confusing definition because the "DF Election Extended Community"

field is only mentioned in the prior paragraph and not shown with this name

in the picture.

I would suggest to replace picture 3 with Figure 4 from rfc8584 - which does

show "Bitmap", and then follow it with Figure 5 from rfc8584 with "T" added,

and then follow with the "Bit 3" bullet point.

315        This capability is utilized in conjunction with the agreed-upon DF

316        Election Type.  For instance, if all the PE devices in the Ethernet

317        Segment indicate possessing Time Synchronization capability and

                            ^^^^^^^^^^

nit:

"the desire to use the" (to be consistent with the definition of T in line 312.

318        request the DF Election Type to be Highest Random Weight (HRW), then

319        the HRW algorithm is edused in conjunction with this capability.  A

                                ^^^^^^

nit: deduced ?

320        PE which does not support the procedures set out in this document, or

321        receives a route from another PE in which th capability is not set

                                                       ^

nit: "e"

322        MUST NOT delay Designated Forwarder election as this could lead to

323        duplicate traffic in some instances (overlapping Designated

324        Forwarders).

326     2.2.  Updates to RFC8584

328        This document introduces an additional delay to the events and

329        transitions defined for the default DF election algorithm FSM in

330        Section 2.1 of [RFC8584] without changing the FSM state or event

331        definitions themselves.

333        Upon receiving a RECV_ES message, the peering PE's Finite State

nit:

RFC8584 uses the term "RCVD_ES" for an event, and does not use the term

"RECV_ES" for a message. Unless there is good reason to introduce new

(inconsistent/duplicate) terminology, pls. change to terminology RCVD_ES event.

Also further below (line 350).

334        Machine (FSM) transitions from the DF_DONE (indicating the DF

335        election process was complete) state to the DF_CALC (indicating that

336        a new DF calculation is needed) state . Due to the Service Carving

337        Time (SCT) included in the Ethernet-Segment update, the completion of

338        the DF_CALC state and the subsequent transition back to the DF_DONE

339        state are delayed.  This delay ensures proper synchronization and

340        prevents conflicts.  Consequently, the accompanying forwarding

341        updates to the Designated Forwarder (DF) and Non-Designated Forwarder

342        (NDF) states are also deferred.

344        The corresponding actions when transitions are performed or states

345        are entered/exited are modified as follows:

nit:

Suggest to rewrite to the following, to be more precise:

Item 9. in RFC8584, Section 2.1, List "Corresponding actions when transitions

are performed or states are entered/exited" is changed as follows:

347        9.  DF_CALC on CALCULATED: Mark the election result for the VLAN or

348            VLAN Bundle.

350            9.1  If an SCT timestamp is present during the RECV_ES event of

351                 Action 11, wait until the time indicated by the SCT before

352                 proceeding to step 9.2.

354            9.2  Assume the role of DF or NDF for the local PE concerning the

355                 VLAN or VLAN Bundle, and transition to the DF_DONE state.

357        This revised approach ensures proper timing and synchronization in

358        the DF election process, avoiding conflicts and ensuring accurate

359        forwarding updates

minor:

a) Given how this is the normative text, i am worried that the "skew" variable

is not mentioned. Please insert accordingly.

b) 9.1 does not seem to cover the SCT delay that needs to be performed (equally,

except for skew) by the newly inserted PE. 9.1 only mentions the condition of

RECV_ES, which to me does not sounds like the newly inserted PE.

minor:

I am somewhat irritated that neither RFC8584 nor this draft have any text in the

state machiner section to indicate when/how ES routes are generated. This would

help IMHO especially in this new draft, because it is the time when the

timestamp is taken, SCT calculated and inserted into the ES route, and i guess

that that also starts the process leading to CALCULATED event on the newly

inserted router.

361     3.  Synchronization Scenarios

363        Consider Figure 1 as an example, where initially PE2 has failed and

364        PE1 has taken over.  This scenario illustrates the problem with the

365        DF-Election mechanism described in Section 8.5 of [RFC7432],

366        specifically in the context of the timer value configured for all PEs

367        on the Ethernet Segment.

369        Procedure based on Section 8.5 of [RFC7432] with the default 3 second

370        timer in step 2:

372        1.  Initial state: PE1 is in a steady-state and PE2 is recovering

374        2.  Recovery: PE2 recovers at an absolute time of t=99.

376        3.  Advertisement: PE2 advertises RT-4, sent at t=100, to partner

377            PE1.

379        4.  Timer Start: PE2 starts a 3 second timer to allow the reception

380            of RT-4 from other PE nodes.

382        5.  Immediate carving: PE1 performs service carving immediately upon

383            RT-4 reception, i.e.  t=100 plus some BGP propagation delay.

385        6.  Delayed Carving: PE2 performs service carving at time t=103

387        [RFC7432] favors traffic drops over duplicate traffic.  With the

388        above procedure, traffic drops will occur as part of each PE recovery

389        sequence since PE1 transitions some VLANs to Non-Designated Forwarder

390        (NDF) immediately upon RT-4 reception.

391        The timer value (default = 3 seconds) directly affects the duration

392        of the packet drops.  A shorter (or zero) timer may result in

393        duplicate traffic or traffic loops.

395        Procedure based on the Service Carving Time (SCT) approach:

397        1.  Initial state: PE1 is in a steady state, and PE2 is recovering

399        2.  Recovery: PE2 recovers at an absolute time of t=99.

401        3.  Advertisement: PE2 advertises RT-4, sent at t=100, with a target

402            SCT value of t=103 to partner PE1.

404        4.  Timer Start: PE2 starts a 3 second timer to allow the reception

405            of RT-4 from other PE nodes.

minor:

IMHO, this is not a 3 second timer, but a timer with a deadline of t=103. Which

is only at most 3 seconds, depending on whether step 4. happens exactly at

t=100 or somewhat later. Practically, it would always be later. IMHO, it  would

be good to emphasize on this crucial benefit of the new mechanism. Maybe need

to insert some addtl. processing delay into the section 8.5 example vs. this

example to show this difference (delay between steps 3 and 4).

407        5.  Service Carving Timer: PE1 starts the service carving timer, with

408            the remaining time until t=103

410        6.  Simultaneous Carving: Both PE1 and PE2 carve at an absolute time

411            of t=103

413        To maintain the preference for minimal loss over duplicate traffic,

414        PE1 should carve slightly before PE2 (with skew).  The recovering PE2

415        performs both DF to NDF and NDF to DF transitions per VLAN at the

416        timer's expiry.  The original PE1, which received the SCT, applies

417        the following:

419        *  DF to NDF Transition(s): at t=SCT minus skew, where both PEs are

420           NDF for the skew duration.

422        *  NDF to DF Transition(s): at t=SCT

minor:

In line 238, the draft says "Upon receipt of the new BGP Extended Community" ...

skew is being applied. Above text (line 419) instead defines application of

skew upon determination of the state transitiom. It may be that in all cases

where the BGP Extended Community is received, there is always only at most a DF

to NDF transition (but no NDF to DF transition), staying at NDF), but it still

is not ideal to have two inconsistent definitions when skew is being applied.

Technically i think the DF to NDF transition case is more sound than the

"receipt of the BGP extended community", aka: fix text around line 238 ?!

424        This split-behavior ensures a smooth DF role transition with minimal

425        loss.

427        Using the SCT approach, the negative effect of the timer to allow the

428        reception of RT-4 from other PE nodes is mitigated.  Furthermore, the

429        BGP Ethernet Segment route (RT-4) transmission delay (from PE2 to

430        PE1) becomes a non-issue.  The SCT approach shortens the 3-second

431        timer window to the order of milliseconds.

433     3.1.  Concurrent Recoveries

435        In the eventuality 2 or more PEs in a peering Ethernet Segment group

436        are recovering concurrently or roughly the same time, each will

437        advertise a Service Carving Timestamp.  This SCT value would

438        correspond to what each recovering PE considers the "end time" for DF

439        Election.  A similar situation arises in sequentially recovering PEs,

440        when a second PE recovers approximately at the time of the first PE's

441        advertised SCT expiry, and with its own new SCT-2 outside of the

442        initial SCT window.

444        In the case of multiple concurrent DF elections, each initiated by

445        one of the recovering PEs, the SCTs must be ordered chronologically.

446        All PEs shall execute only a single DF Election at the service

447        carving time corresponding to the largest (latest) received timestamp

448        value.  This DF Election will involve all active PEs in a unified DF

449        Election update.

nit:

I think the wording 444-449 is misleading/incomplete. The latest SCT timestamp

is not the top critera, but if i understand the intent correctly, each "later"

PEi also needs to be considered to be a better(best) DF than the prior PE,

right ? Aka: In your below example (line 451ff),

    PE1 is DF

    When PE1 receives RT-4 from PE2, PE1 will redo DF calculation and

    consider PE2 to be the DF winner

    When PE2 later receives RT-4 from PE3, PE1 will redo DF calculation

    and now consider PE3 to be the DF winner. And only because PE3 is the

    DF winner, will PE1 now also cancel the SCT for PE2.

If on the other hand, the DF HRW for PE3 would be lower than that of PE2,

than PE1 would of course redo the DF election but given how PE3 does not

show the result, this AFAIK should also mean that the SCT from PE3 should have

no impact.

Yes/No ?

In any case it would be useful to improve the description to make this clearer.

Especially if/when i misunderstood it.

451        Example:

453        1.  Initial State: PE1 is in a steady state, with services elected at

454            PE1.

456        2.  Recovery of PE2: PE2 recovers at time t=100 and advertises RT-4

457            with a target SCT value of t=103 to its partners (PE1)

459        3.  Timer Initiation by PE2: PE2 starts a 3 second timer to allow the

460            reception of RT-4 from other PE nodes.

462        4.  Timer Initiation by PE1: PE1 starts the service carving timer,

463            with the remaining time until t=103.

465        5.  Recovery of PE3: PE3 recovers at time t=102 and advertises RT-4

466            with a target SCT value of t=105 to its partners (PE1, PE2).

468        6.  Timer Initiation by PE3: PE3 starts a 3 second timer to allow the

469            reception of RT-4 from other PE nodes

471        7.  Timer Update by PE2: PE2 cancels the running timer and starts the

472            service carving timer with the remaining time until t=105.

474        8.  Timer Update by PE1: PE1 updates its service carving timer, with

475            the remaining time until t=105.

477        9.  Service Carving: PE1, PE2, and PE3 perform service carving at the

478            absolute time of t=105.

480        In the eventuality a PE in a Ethernet Segment group recovers during

481        the discovery window specified in Section 8.5 of [RFC7432], and does

482        not support or advertise the T-bit, then all PEs in the current

483        peering sequence SHALL immediately revert to the default [RFC7432]

484        behavior.

486     4.  Backwards Compatibility

488        For the DF election procedures to achieve global convergence and

489        unanimity within a redundancy group, it is essential that all

490        participating PEs agree on the DF election algorithm to be employed.

491        However, it is possible that some PEs may continue to use the

492        existing modulo-based DF election algorithm from [RFC7432] and not

493        utilize the new Service Carving Time (SCT) BGP extended community.

494        PEs that operate using the baseline DF election mechanism will simply

495        discard the new SCT BGP extended community as unrecognized.

496        [RFC7432] and do not rely on the new SCT BGP extended community.

498        A PE can indicate its willingness to support clock-synchronized

499        carving by signaling the new 'T' DF Election Capability and including

500        the new SCT BGP extended community along with the Ethernet Segment

501        Route (Type-4).  If one or more PEs attached to the Ethernet Segment

502        do not signal T=1, then all PEs in the Ethernet Segment SHALL revert

503        to the timer-based approach as specified in [RFC7432].  This

504        reversion is particularly crucial in preventing VLAN shuffling when

505        more than two PEs are involved.

507     5.  Security Considerations

509        The mechanisms in this document use EVPN control plane as defined in

510        [RFC7432].  Security considerations described in [RFC7432] are

511        equally applicable.

513        For the new SCT Extended Community, attack vectors may be setting the

514        value to zero, to a value in the past or to large times in the

515        future.  The procedures in this document address implicitly what

516        occurs with a carving time in the past, as this would be a naturally

517        occurring event with a large BGP propagation delay: the receiving PE

518        SHALL treat the DF Election at the peer as having occurred already,

519        and proceed without starting any timer to futher delay service

520        carving.  For timestamp values in the future, a rogue PE may be

521        advertising a value inconsistent with its local behavior.  This is no

522        different than a rogue PE setting all its DF Election results

523        inconstently to its peers using (or ignoring adherence to) the

524        procedures from [RFC7432], and the result would similarly be

525        duplicate or dropped traffic.  It is left to implementations to

526        decide what consists an "unreasonably large" SCT value.

528        This document uses MPLS and IP-based tunnel technologies to support

529        data plane transport.  Security considerations described in [RFC7432]

530        and in [RFC8365] are equally applicable.

532     6.  IANA Considerations

534        IANA maintains the "EVPN Extended Community Sub-Types" registry set

535        up by [RFC7153].  IANA is requested to confirm the First Come First

536        Served assignment as follows:

538           Sub-Type Value   Name                        Reference       Date

539           --------------   -------------------------   -------------   ----

540                 0x0F       Service Carving Timestamp   This document   TBD

542        IANA should replace the field TBD with the date of publicaton of this

543        document as an RFC.

545        IANA maintains the "DF Election Capabilities" registry set up by

546        [RFC8584].  IANA is requested to make the following assignment from

547        this registry:

549            Bit         Name                         Reference        Date

550            ----        ----------------             -------------    ----

551            3           Time Synchronization         This document    TBD

553        IANA should replace the field TBD with the date of publicaton of this

554        document as an RFC.

556     7.  Normative References

558        [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate

559                   Requirement Levels", BCP 14, RFC 2119,

560                   DOI 10.17487/RFC2119, March 1997,

561                   <https://www.rfc-editor.org/info/rfc2119>.

563        [RFC5905]  Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch,

564                   "Network Time Protocol Version 4: Protocol and Algorithms

565                   Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010,

566                   <https://www.rfc-editor.org/info/rfc5905>.

568        [RFC7153]  Rosen, E. and Y. Rekhter, "IANA Registries for BGP

569                   Extended Communities", RFC 7153, DOI 10.17487/RFC7153,

570                   March 2014, <https://www.rfc-editor.org/info/rfc7153>.

572        [RFC7432]  Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A.,

573                   Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based

574                   Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February

575                   2015, <https://www.rfc-editor.org/info/rfc7432>.

577        [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC

578                   2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,

579                   May 2017, <https://www.rfc-editor.org/info/rfc8174>.

581        [RFC8365]  Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R.,

582                   Uttaro, J., and W. Henderickx, "A Network Virtualization

583                   Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365,

584                   DOI 10.17487/RFC8365, March 2018,

585                   <https://www.rfc-editor.org/info/rfc8365>.

587        [RFC8584]  Rabadan, J., Ed., Mohanty, S., Ed., Sajassi, A., Drake,

588                   J., Nagaraj, K., and S. Sathappan, "Framework for Ethernet

589                   VPN Designated Forwarder Election Extensibility",

590                   RFC 8584, DOI 10.17487/RFC8584, April 2019,

591                   <https://www.rfc-editor.org/info/rfc8584>.

593     Appendix A.  Contributors

595        In addition to the authors listed on the front page, the following

596        co-authors have also contributed substantially to this document:

598        Gaurav Badoni

599        Cisco

601        Email: gbadoni@xxxxxxxxx

603        Dhananjaya Rao

604        Cisco

606        Email: dhrao@xxxxxxxxx

608     Appendix B.  Acknowledgements

610        Authors would like to acknowledge helpful comments and contributions

611        of Satya Mohanty and Bharath Vasudevan.  Also thank you to Anoop

612        Ghanwani and Gunter van de Velde for their thorough review with

613        valuable comments and corrections.

615     Authors' Addresses

617        Patrice Brissette (editor)

618        Cisco

619        Email: pbrisset@xxxxxxxxx

621        Ali Sajassi

622        Cisco

623        Email: sajassi@xxxxxxxxx

625        Luc Andre Burdet

626        Cisco

627        Email: lburdet@xxxxxxxxx

629        John Drake

630        Independent

631        Email: je_drake@xxxxxxxxx

633        Jorge Rabadan

634        Nokia

635        Email: jorge.rabadan@xxxxxxxxx

EOF

_______________________________________________

BESS mailing list -- bess@xxxxxxxx

To unsubscribe send an email to bess-leave@xxxxxxxx

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx