Hi Toerless,
Thank you for the detailed review. I have updated the inline-comments for -10 which will be posted soon. For the itemized questions please see below. Thanks !
G.1 and G.2 : I will leave that question for a wider scope, this document merely updates existing RFCs -> and the reference to HRW is ‘en passant’ as an improvement which happened over time (perfect
or not...)
G.3 is a very interesting proposal actually, for orderly ‘removal’ from network (maintenance operations). I will give this some more thought with co-authors to see how to incorporate this, thanks
for the valuable suggestion !
G.4 please note this draft currently addresses “controlled recovery” only, not “controlled failures” (as in G.3). while technically accurate, in reality interface recovery is very rarely the
“same millisecond” or close thereto.
In practice, even if unlatched all together interfaces recovering will also have some time gaps in between them. To address this concern is to provide for a non-default (configured) skew to account
for hw programing speed(s). More pertinent though, is that this draft allows for larger non-default peering values (the 3s from base RFC) and interfaces that have known-slow-programming or a large number of subinterfaces or hosts to program can easily avail
of a larger peering timer specific to the conditions of that ES. The SCT represents the wall-clock of this base-RFC peering timer at the recovering PE.
G.5/G.6 the variant (a) is the one I am aware of implemented by vendors: wait for NTP sync before proceeding to many or most operations in control plane, incl this peering of ethernet-segments.
If NTP snc becomes an issue (on router first-reload for example) delays are usually added prior to inserting the router into the network (advertising routes). In short, NTP sync often becomes a gate to some operations -> I could add some text with a stronger
link to clock-sync before including the SCT extended community ?
G.7 this was always poorly written – I have updated to “substracting a positive value” throughtout- but the “break before make” is actually on purpose. On recovery you do not want 2 interfaces
in DF mode, that will create duplicates, loops etc.
Regards,
Luc André
Luc André Burdet | Cisco | laburdet.ietf@xxxxxxxxx | Tel: +1 613 254 4814
From:
Toerless Eckert via Datatracker <noreply@xxxxxxxx>
Date: Wednesday, August 14, 2024 at 12:27
To: iot-directorate@xxxxxxxx <iot-directorate@xxxxxxxx>
Cc: bess@xxxxxxxx <bess@xxxxxxxx>, draft-ietf-bess-evpn-fast-df-recovery.all@xxxxxxxx <draft-ietf-bess-evpn-fast-df-recovery.all@xxxxxxxx>, last-call@xxxxxxxx <last-call@xxxxxxxx>, evyncke@xxxxxxxxx <evyncke@xxxxxxxxx>
Subject: [bess] Iotdir telechat review of draft-ietf-bess-evpn-fast-df-recovery-09
Reviewer: Toerless Eckert
Review result: On the Right Track
Reviewer: Toerless Eckert
Summary:
The purpose of the document is to extend the BGP message signaling and local
router procedures for failover of "Designated Forwarders" for pseudowires using
calculated future timestamps and expecting clock synchronization across the
forwarders, so that after receipt of the BGP message, the switchover can be
handled autonomously by every node as synchronously as desired and allowed for
by the clock synchronization method used.
Review result: On The Right Track
I am the assigned IOTDIR reviewer. I found the document well written and easy
to read, except for some typos, other nits and some logical description gap.
(unfortunately ?) I find the approach of the draft very useful, and i always
wished we would have been able to build this in other IETF protocol domains (IP
multicast), so i happen to have a range of technical concerns and suggestions
primarily around the completeness of the documents methods and detail
specifications, which i hope will be helpfull to improve on the quality of the
text and usefulness of the solution.
The following is a list of G.i general comments followed by the commented
idnits version of the draft.
Thank you very much for the work!
Toerless Eckert
General comments:
G.1 minor: Why IOTdir review ?
I am a bit puzzled why this draft was given to IOTdir for early review. Neither
the draft nor the RFCs it references mentions IoT. And the mentioned pseudowire
use-cases are all around DataCenter. So i wonder what specific IoT feedback the
authors/WG is looking for. If thereactually is a specific type of use-cases for
IoT with this technology, then it would be great to mention.
G.2 minor/suggestion: HRW has known problems
HRW was popularized and (in)validated in deployments of PIM-SM since 1995 and
hence rfc2362 way before HRW1998 was written, but of course not credited in
RFC8485. I would nevertheless like to point out that the IP Multicast community
in the IETF had some run-ins with operators over the decades who where
disappointed by its non-equal distribution in face of specific typical set of
parameters such as consecutive or close to each other router-IDs. Of course,
the parameters used in EVPN are different, and i have not tried to validate if
or how such deployment specific anomalies would or could equally apply to the
EVPN version, but i would strongly suggest to be aware that HRW is by far a
well randomizing algorithm especially for the order of the input parameters.
HRW is now probably 30 years old, and maybe EVPN may wants to look into newer,
and supposedly better algorithms such as MurmurHash (which was a recommendation
from a math geek colleague even 15 years ago - and other proposals in the IETF
are picking up on it too).
G.3 minor/question: Please consider adding ordered shutdown support
If my understanding of RFC7432/RFC8584 and this draft is correct, the
interruption in case of an ordered shutdown of a DF is as large as that of an
unexpected shutdown/service interruption (without the detection of interruption
of course). I think this is not necessary.
I think it would be great if this draft could add support for the synchronized
switchover in case of ordered shutdown of a DF because such procedures
constitute likely a large number of outages in daily operations of larger
networks.
For example, the new extended community could have a flag indication of such an
ordered shutdown so that the indicated SCT will trigger synchronized failover
to the BDF (Backup DF). And only after the failover has happened would the
primary DF send out the NLRI withdraw route and finish the shutdown operation.
G.4 mayor: analysis of actual failover behavior
The mechanism of this draft seems to aspire through synchronized switchover to
achieve a switchover interruption in the order of 10 msec (the skew default
value). I am worried that in the face of a large number of failovers (because
of a large number of VLAN/ES services), that the interruption becomes larger
and that it will be inconsistent across different services.
The way i imagine the failover to operate (from similar failovers n other
technologies like multicast), A router may fairly quickly be able to generate
the SCT carrying routes, so there can be a burst of SCT routes all with the
same SCT. When those SCT then actually expire both on the sending and receiving
router, the speed at which they are added/deleted in hardware-forwarding will
depend on the performance of updating hardware forwarding registers. Which may
be inconsistent across different routers. It is also not clear to me if the BGP
infrastructure or other factors can or can not introduce any reordering. But if
for example we have thousand routes that need to be updated, and one router can
update 1000 routes/sec and the other can update 2000 routes/sec, then one will
be done after half a second, the other after one second - no reordering assumed.
So it would be very helpfull to have some idea about the maximum imaginable
scalability required and likely min/max performances to vet the impact of this
candidate issue.
There is of course a way to overcome this issue, which is to generate SCT that
take the performance of (de)installation of hardware forwarding entries into
account, for example by assuming some floor performance and generate SCT for
such burst of service routes with timestamps increasing such that when they
will be executed, they will stay under such a performance floor. Aka: Have a
difference of e.g.: 4msec between each route, in result creating no more than
250 SCP updates/second.
In any case, it would be great if the grat target goal of this draft - less
than 10 msec interruption would not be invalidated by such real-world
performance impacts if it actually is easy to overcome it with a bit of
additional text in the draft.
G.5 mayor: Behavior upon non-synchronization.
I think the draft should do more due-diligence in its text for various
conditions of non-correct time synchronization between devices. Let first agree
on the conditions and general direction, and the i am happy to propose text if
it makes sense to the WG.
a) A router can and then should validate the state of synchronization of its
clock (in NTP for example this is typically possible via some management API,
not sure if there is already a YANG model). When restarting, the that its clock
is not synchronized to a necessary degree of accuracy yet. Minimum required
synchronization accuracy should be configurable, default maybe 3 msec. In this
case the router would wait until the synchronization is sufficient up to a
maximum time period (configurable, default maybe 30 seconds). If
synchronization is not sufficient then, revert to behave as non-draft compliant
router - and upgrade later on if and when synchronization is successful.
b) A router which is aware that it is correctly synchronized is is receiving an
SCT update from another router which did not correctly recognize its own
synchronization failure (e.g.: does not have the API to validate its local
clock being synchronized). This condition might warrant a flag bit in the
route updates, if feasible.
To discover and work around this condition, routers will perform plausibility
check on received SCT timestamps, e.g.: validate that the received timestamp is
within a reasonable window around the local (synchronzied) clock at the time of
reception of the SCT carrying route: at least one second from current clock, at
most the configured interval (default 3 seconds), plus extensions, such as some
seconds if concern G.4 is taken into account. If ithe received SCD is out of
bounds, then the receiving router would raise some error condition and perform
some fallback failover, e.g.: within 3 seconds from reception (to avoid that
failover would happen at an imappropriately long time in the future
immediately, when SCT is in the past).
G.6 minor: some suggested NTP operational text
The following is proposed text for some NTP clock synchronization operational
considerations sections including only G.5 suggestion a). But also other
aspects crucial for successfull deployment.
----
While the use of a synchronized clock between the participating routers makes
the solution itself very simple and accurate, it does introduce a new
potentially large and complex dependency against the clock synchronization
mechanism used. Because of the use of NTP timestamps, it is not possible to
build really lightweight and autonomously operating clock synchronization
systems. Instead, one will likely need to create an operational dependency
against a clock source with automated inclusion of complexities specifically
the leap seconds, which includes satellite clock sources (Beidou, Galileo,
GLONASS or GPS), or terrestrial (DCF77, WWVB, MSF or JJY). If this dependency
is operationally already established for other purposes, then the mechanism of
this document does not provide incremental requirements except maybe for the
required accuracy. Otherwise the requirements to operate the clock
synchronization need to be analyzed.
For the mechanism of this document to provide the desired benefit,
synchronization of a few millisecond (5) or less is required, so that the skew
is sufficient to separate the break DF times from the make DF times. This
should in general not be a problem to achieve with minimal NTPv4 installations
that are aware of common pittfalls as follows.
When a router restarts, initial synchronization to other NTP server(s) is sped
up if the router has a local battery backed RTC clock from which it can derive
derive a starting time as well as the capability to step the clock to quickly
synchronize to the other NTP server(s).
If either is not possible, synchronization may take more than a few seconds
after reboot and it may be desirable to delay the bringing up DF functionality
up until the desired accuracy of clock synchronization is achieved.
Synchronization across WAN links can be subject to asymmetric latency, which
can be as high as some msec, such as for pseudowires across transcontinental
connectibity between backup DCs. Clock synchronization protocols can not
automatically figure out such asymmetric propagation latencies. If deployments
with such asymmetric latencies is required, the clock synchronization protocol
needs to have options to learn about such asymmetries, such as through
configuration.
G.7 minor: make before break instead of break before make
I think that it would make sense to define skew as configurable and explicitly
point to the option of making it positive so as to achieve "make before break"
functionality, E.g.: making the recovering router become DF slightly before the
withdrawing router.
I can think of several type of customer services that can better deal with
duplicates than with even short term losses. And unless i am overlooking some
looping issues in the broadcast domains (which i likely may), the only reason
to do break before make is IMHO services where the simultaneous sending will
result in overload. But whenever a service has a lot rate of actual user
traffic, most application will prefer a few duplicates over a few losst packets.
--
The following is idnits output to have line numbers. issues/discussions from
the review have no line numbers.
------
draft-ietf-bess-evpn-fast-df-recovery-09.txt:
Showing Errors (**), Flaws (~~), Warnings (==), and Comments (--).
Errors MUST be fixed before draft submission. Flaws SHOULD be fixed before
draft submission.
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
No issues found here.
Checking nits according to
https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
No issues found here.
Running in submission checking mode -- *not* checking nits according to
https://www.ietf.org/id-info/checklist .
----------------------------------------------------------------------------
No nits found.
--------------------------------------------------------------------------------
2 BESS Working Group P. Brissette, Ed.
3 Internet-Draft A. Sajassi
4 Updates: 8584 (if approved) LA. Burdet
5 Intended status: Standards Track Cisco
6 Expires: 9 January 2025 J. Drake
7 Independent
8 J. Rabadan
9 Nokia
10 8 July 2024
12 Fast Recovery for EVPN Designated Forwarder Election
13 draft-ietf-bess-evpn-fast-df-recovery-09
15 Abstract
17 The Ethernet Virtual Private Network (EVPN) solution provides
18 Designated Forwarder (DF) election procedures for multihomed Ethernet
19 Segments. These procedures have been enhanced further by applying
20 Highest Random Weight (HRW) algorithm for Designated Forwarder
21 election in order to avoid unnecessary DF status changes upon a
22 failure. This document improves these procedures by providing a fast
23 Designated Forwarder election upon recovery of the failed link or
24 node associated with the multihomed Ethernet Segment. This document
25 updates Section 2.1 of [RFC8584] by optionally introducing delays
26 between some of the events therein.
28 The solution is independent of the number of EVPN Instances (EVIs)
29 associated with that Ethernet Segment and it is performed via a
30 simple signaling between the recovered node and each of the other
31 nodes in the multihoming group.
33 Status of This Memo
35 This Internet-Draft is submitted in full conformance with the
36 provisions of BCP 78 and BCP 79.
38 Internet-Drafts are working documents of the Internet Engineering
39 Task Force (IETF). Note that other groups may also distribute
40 working documents as Internet-Drafts. The list of current Internet-
41 Drafts is at https://datatracker.ietf.org/drafts/current/.
43 Internet-Drafts are draft documents valid for a maximum of six months
44 and may be updated, replaced, or obsoleted by other documents at any
45 time. It is inappropriate to use Internet-Drafts as reference
46 material or to cite them other than as "work in progress."
48 This Internet-Draft will expire on 9 January 2025.
50 Copyright Notice
52 Copyright (c) 2024 IETF Trust and the persons identified as the
53 document authors. All rights reserved.
55 This document is subject to BCP 78 and the IETF Trust's Legal
56 Provisions Relating to IETF Documents (https://trustee.ietf.org/
57 license-info) in effect on the date of publication of this document.
58 Please review these documents carefully, as they describe your rights
59 and restrictions with respect to this document. Code Components
60 extracted from this document must include Revised BSD License text as
61 described in Section 4.e of the Trust Legal Provisions and are
62 provided without warranty as described in the Revised BSD License.
64 Table of Contents
66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
67 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
68 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3
69 1.3. Challenges with Existing Mechanism . . . . . . . . . . . 3
70 1.4. Design Principles for a Solution . . . . . . . . . . . . 5
71 2. DF Election Synchronization Solution . . . . . . . . . . . . 5
72 2.1. BGP Encoding . . . . . . . . . . . . . . . . . . . . . . 6
73 2.2. Updates to RFC8584 . . . . . . . . . . . . . . . . . . . 7
74 3. Synchronization Scenarios . . . . . . . . . . . . . . . . . . 8
75 3.1. Concurrent Recoveries . . . . . . . . . . . . . . . . . . 10
76 4. Backwards Compatibility . . . . . . . . . . . . . . . . . . . 11
77 5. Security Considerations . . . . . . . . . . . . . . . . . . . 11
78 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12
79 7. Normative References . . . . . . . . . . . . . . . . . . . . 12
80 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 13
81 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 13
82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 14
84 1. Introduction
86 The Ethernet Virtual Private Network (EVPN) solution [RFC7432] is
87 becoming pervasive in data center (DC) applications for Network
88 Virtualization Overlay (NVO) and DC interconnect (DCI) services, and
89 in service provider (SP) applications for next generation virtual
90 private LAN services.
nit: If there is any IoT use, please mention
nit: "pervasive" is a bold statement. I do not know enough to support or doubt
it, but if there was any reference you could add to support the claim, then it
would make it stronger. Else maybe tone it down ("widely used")...
92 [RFC7432] describes Designated Frowarder (DF) election procedures for
^ typo
93 multihomed Ethernet Segments. These procedures are enhanced further
94 in [RFC8584] by applying the Highest Random Weight (HRW) algorithm
nit:
please add the HRW1998 reference as used in RFC8584 as reference for the
term HRW and include it here.
95 for DF election in order to avoid unnecessary DF status changes upon
96 a link or node failure associated with the multihomed Ethernet
97 Segment. This document makes further improvements to the DF election
nit: insert paragraph break before "This" (background -> contribution).
98 procedures in [RFC8584] by providing an option for a fast DF election
99 upon recovery of the failed link or node associated with the
100 multihomed Ethernet Segment. This DF election is achieved
101 independent of the number of EVPN Instances (EVIs) associated with
102 that Ethernet Segment and it is performed via straightforward
103 signaling between the recovered node and each of the other nodes in
104 the multihomed group.
105 This document updates the DF Election Finite State Machine (FSM)
106 described in Section 2.1 of [RFC8584], by optionally introducing
107 delays between some events, as further detailed in Section 2.2. The
108 solution is based on a simple one-way signaling mechanism.
110 1.1. Requirements Language
112 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
113 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
114 "OPTIONAL" in this document are to be interpreted as described in BCP
115 14 [RFC2119] [RFC8174] when, and only when, they appear in all
116 capitals, as shown here.
118 1.2. Terminology
120 PE: Provider Edge device.
122 Designated Forwarder (DF): A PE that is currently forwarding
123 (encapsulating/decapsulating) traffic for a given VLAN in and out
124 of a site.
126 EVI: An EVPN instance spanning the Provider Edge (PE) devices
127 participating in that EVPN.
129 1.3. Challenges with Existing Mechanism
131 In EVPN technology, multiple Provider Edge (PE) devices have the
132 ability to encapsulate and decapsulate data belonging to the same
133 VLAN. Under certain conditions, this may cause Layer2 duplicates and
134 potential loops if there is a momentary overlap in forwarding roles
135 between two or more PE devices, consequently leading to broadcast
136 storms.
138 EVPN [RFC7432] currently specifies timer-based synchronization among
139 PE devices within a redundancy group. This approach can lead to
140 duplications and potential loops due to multiple Designated
141 Forwarders (DFs) if the timer interval is too short, or to packet
142 drops if the timer interval is too long.
144 Split-horizon filtering, as described in Section 8.3 of [RFC7432],
145 can prevent loops but does not address duplicates. However, if there
146 are overlapping Designated Forwarders (DFs) of two different sites
147 simultaneously for the same VLAN, the site identifier will differ
148 when the packet re-enters the Ethernet Segment. Consequently, the
149 split-horizon check will fail, resulting in Layer 2 loops.
minor:
i can not find a description of this setup and problem in [RFC7342],
and the description in the paragraph above is quite terse so that i am not
sure that i would make up from scratch a fitting example. I think it would
thus be useful to provide an topology with an appropriate example of this
condition and explain the problem based on that topology example.
151 The updated Designated Forwarder (DF) procedures outlined in
152 [RFC8584] use the well-known Highest Random Weight (HRW) algorithm to
153 prevent the reshuffling of VLANs among PE devices within the
154 redundancy group during failure or recovery events. This approach
155 minimizes the impact on VLANs not assigned to the failed or recovered
156 ports and eliminates the occurrence of loops or duplicates during
157 such events.
159 However, upon PE insertion or a port being newly added to a
160 multihomed Ethernet Segment, HRW also cannot help as a transfer of DF
161 role to the new port must occur while the old DF is still active.
163 +---------+
164 +-------------+ | |
165 | | | |
166 / | PE1 |----| | +-------------+
167 / | | | MPLS/ | | |---CE3
168 / +-------------+ | VxLAN/ | | PE3 |
169 CE1 - | Cloud | | |
170 \ +-------------+ | |---| |
171 \ | | | | +-------------+
172 \ | PE2 |----| |
173 | | | |
174 +-------------+ | |
175 +---------+
177 Figure 1: CE1 multihomed to PE1 and PE2.
179 In Figure 1, when PE2 is inserted in the Ethernet Segment or its
180 CE1-facing interface recovered, PE1 will transfer the DF role of some
181 VLANs to PE2 to achieve load balancing. However, because there is no
182 handshake mechanism between PE1 and PE2, overlapping of DF roles for
183 a given VLAN is possible which leads to duplication of traffic as
184 well as Layer 2 loops.
186 Current EVPN specifications [RFC7432] and [RFC8584] rely on a timer-
187 based approach for transferring the DF role to the newly inserted
188 device. This can cause the following issues:
190 * Loops/Duplicates if the timer value is too short
191 * Prolonged Traffic Blackholing if the timer value is too long
193 1.4. Design Principles for a Solution
195 The clock-synchronization solution for fast DF recovery presented in
196 this document follows several design principles and presents
197 multiples advantages, namely:
199 * Complex handshake signaling mechanisms and state machines are
200 avoided in favor of a simple uni-directional signaling approach.
202 * The fast DF recovery solution maintains backwards-compatibility
203 (see Section 4) by ensuring that PEs any unrecognized new BGP
204 Extended Community.
206 * Existing DF Election algorithms remain supported.
208 * The fast DF recovery solution is independent of any BGP delays in
209 propagation of Ethernet Segment routes (Route Type 4)
minor:
This claim is unclear to me. There is an overall maximum for the propagation
latency plus processing time of "just" a few seconds with the default SCT
calculation, right ? And that is communicated "in conjunction with" the
Ethernet Segment routes according to your below explanation. So there is
a maximum propagation limit. And likely some serialization, timing
dependencies.... ??!!
211 * The fast DF recovery solution is agnostic of the actual time
212 synchronization mechanism used, and normalizes to NTP for EVPN
213 signalling only.
XXX
215 2. DF Election Synchronization Solution
217 The fast DF recovery solution relies on the concept of common clock
218 alignment between partner PEs participating in a common Ethernet
219 Segment i.e. PE1 and PE2 in Figure 1. The main idea is to have all
220 peering PEs of that Ethernet Segment perform DF election, and apply
221 the result at the same pre-announced time.
223 The DF Election procedure, as described in [RFC7432] and as
224 optionally signalled in [RFC8584], is applied. All PEs attached to a
225 given Ethernet Segment are clock-synchronized using a networking
226 protocol for clock synchronization (e.g., NTP, PTP). When a new PE
227 is inserted in an Ethernet Segment or a failed PE device of the
228 Ethernet Segment recovers, that PE communicates to peering partners
229 the current time plus the value of the timer for partner discovery
230 from step 2 in Section 8.5 of [RFC7432]. This constitutes an "end
231 time" or "absolute time" as seen from the local PE. That absolute
232 time is called the "Service Carving Time" (SCT).
234 A new BGP Extended Community, the Service Carving Timestamp is
235 advertised along with the Ethernet Segment route (RT-4) to
236 communicate the Service Carving Time to other partners.
238 Upon receipt of the new BGP Extended Community, partner PEs can
239 determine the service carving time of the newly insterted PE. To
240 eliminate any potential for duplicate traffic or loops, the concept
241 of skew is introduced: a small time offset to ensure a controlled and
242 orderly transition when multiple Provider Edge (PE) devices are
243 involved. The receiving partner PEs add a skew (default = -10ms) to
244 the Service Carving Time to enforce this mechanism. The previously
245 inserted PE(s) must perform service carving first, followed shortly
246 by the newly insterted PE, after the specified skew delay.
248 To summarize, all peering PEs perform service carving almost
249 simultaneously at the time announced by the newly added/recovered PE.
250 The newly inserted PE initiates the SCT, and triggers service carving
251 immediately on its local timer expiry. The previously inserted PE(s)
252 receiving Ethernet Segment route (RT-4) with a SCT BGP extended
253 community, perform service carving shortly before Service Carving
254 Time.
256 2.1. BGP Encoding
258 A new BGP extended community is defined to communicate the Service
259 Carving Timestamp for each Ethernet Segment.
261 A new transitive extended community where the Type field is 0x06, and
262 the Sub-Type is 0x0F is advertised along with the Ethernet Segment
263 route. The expected Service Carving Time is encoded as an 8-octet
264 value as follows:
266 1 2 3
267 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
268 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
269 | Type = 0x06 | Sub-Type(0x0F)| Timestamp Seconds ~
270 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
271 ~ Timestamp Seconds | Timestamp Fractional Seconds |
272 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
274 Figure 2: Service Carving Time
276 The timestamp exchanged uses the NTP prime epoch of January 1, 1900
277 [RFC5905] and the 64-bit NTP Timestamp Format. The NTP Era value is
278 not exchanged and Era 0 is assumed as of the writing of this
279 document. A DF Election operation occurring exactly at the Era
280 transition boundary some time in 2036 is outside of the scope of this
281 document.
mayor:
This description effectively only supports the protocol until the end of Era 0,
because it not only describes what to do during switchover to Era N+1, but
it does not describe how to operate without encoding the Era. This makes
the protocol useful (without another RFC) for less than 12 years. That is IMHO
insufficient.
One simple solution, would be to describe that the Era is not included in the
encoding, but that a plausibility check is made on received timestamps. If it
is completely out of range with the receiving routers current Era, but within
rage with Era-1 or Era+1, then the timestamp is accordingly adjusted to use that
Era.
In another solution option, you can encode the Era by carving space from the SCT
encoding as follows:
IMHO, it is unnecessary to encode the fractional seconds with 16 bits.
The accuracy of the signalled timestamp does NOT impact the synchronized
accuracy of the execution of DF switchover. It only impacts the granularity of
timestamps that can be generated. If you would signal only the top 8 bits of
the fractional seconds, then you could still trigger a synchronized switchover
at intervals of 4 msec, which IMHO is more than necessary. And the switchover
could still be synchronized to an arbitrary better accuracy, such as 1 usec if
just the clock synchronization between the router is that good. Practically
speaking, NTP clock synchronization may often be just 1 msec accurate anyhow.
Even if you consider my thoughts from above concern G.4, and want to assign
different timestamps for every Ethernet Segment (especially with large number
of ethernet segments), then an interval of 4 msec would likely be more than
sufficient granularity.
So with just 8 bit fractional second encoding, you have 8 bit spare in the
encoding you can use for Era and other features (in the future).
282 The 64-bit NTP Timestamp Format consists of a 32-bit part for Seconds
283 and a 32-bit part for Fraction, which are encoded in the Service
284 Carving Time as follows:
286 * Timestamp Seconds: 32-bit NTP seconds are encoded in this field.
288 * Timestamp Fractional Seconds: the high order 16 bits of the NTP
289 'Fraction' field are encoded in this field.
291 When rebuilding a 64-bit NTP Timestamp Format using the values from a
292 received SCT BGP extended community, the lower order 16 bits of the
293 Fractional field are set to 0. The use of a 16-bit fractional
294 seconds yields adequate precision of 15 microseconds (2^-16 s).
296 This document introduces a new flag called "T" (for Time
297 Synchronization) to the bitmap field of the DF Election Extended
298 Community defined in [RFC8584].
300 1 2 3
301 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
302 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
303 | Type = 0x06 | Sub-Type(0x06)| RSV | DF Alg | |A| |T| ~
304 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
305 ~ Bitmap | Reserved = 0 |
306 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
308 Figure 3: DF Election Extended Community
310 * Bit 3: Time Synchronization (corresponds to Bit 27 of the DF
311 Election Extended Community). When set to 1, it indicates the
312 desire to use Time Synchronization capability with the rest of the
313 PEs in the Ethernet Segment.
nit:
"Bit 3" is a confusing definition because the "DF Election Extended Community"
field is only mentioned in the prior paragraph and not shown with this name
in the picture.
I would suggest to replace picture 3 with Figure 4 from rfc8584 - which does
show "Bitmap", and then follow it with Figure 5 from rfc8584 with "T" added,
and then follow with the "Bit 3" bullet point.
315 This capability is utilized in conjunction with the agreed-upon DF
316 Election Type. For instance, if all the PE devices in the Ethernet
317 Segment indicate possessing Time Synchronization capability and
^^^^^^^^^^
nit:
"the desire to use the" (to be consistent with the definition of T in line 312.
318 request the DF Election Type to be Highest Random Weight (HRW), then
319 the HRW algorithm is edused in conjunction with this capability. A
^^^^^^
nit: deduced ?
320 PE which does not support the procedures set out in this document, or
321 receives a route from another PE in which th capability is not set
^
nit: "e"
322 MUST NOT delay Designated Forwarder election as this could lead to
323 duplicate traffic in some instances (overlapping Designated
324 Forwarders).
326 2.2. Updates to RFC8584
328 This document introduces an additional delay to the events and
329 transitions defined for the default DF election algorithm FSM in
330 Section 2.1 of [RFC8584] without changing the FSM state or event
331 definitions themselves.
333 Upon receiving a RECV_ES message, the peering PE's Finite State
nit:
RFC8584 uses the term "RCVD_ES" for an event, and does not use the term
"RECV_ES" for a message. Unless there is good reason to introduce new
(inconsistent/duplicate) terminology, pls. change to terminology RCVD_ES event.
Also further below (line 350).
334 Machine (FSM) transitions from the DF_DONE (indicating the DF
335 election process was complete) state to the DF_CALC (indicating that
336 a new DF calculation is needed) state . Due to the Service Carving
337 Time (SCT) included in the Ethernet-Segment update, the completion of
338 the DF_CALC state and the subsequent transition back to the DF_DONE
339 state are delayed. This delay ensures proper synchronization and
340 prevents conflicts. Consequently, the accompanying forwarding
341 updates to the Designated Forwarder (DF) and Non-Designated Forwarder
342 (NDF) states are also deferred.
344 The corresponding actions when transitions are performed or states
345 are entered/exited are modified as follows:
nit:
Suggest to rewrite to the following, to be more precise:
Item 9. in RFC8584, Section 2.1, List "Corresponding actions when transitions
are performed or states are entered/exited" is changed as follows:
347 9. DF_CALC on CALCULATED: Mark the election result for the VLAN or
348 VLAN Bundle.
350 9.1 If an SCT timestamp is present during the RECV_ES event of
351 Action 11, wait until the time indicated by the SCT before
352 proceeding to step 9.2.
354 9.2 Assume the role of DF or NDF for the local PE concerning the
355 VLAN or VLAN Bundle, and transition to the DF_DONE state.
357 This revised approach ensures proper timing and synchronization in
358 the DF election process, avoiding conflicts and ensuring accurate
359 forwarding updates
minor:
a) Given how this is the normative text, i am worried that the "skew" variable
is not mentioned. Please insert accordingly.
b) 9.1 does not seem to cover the SCT delay that needs to be performed (equally,
except for skew) by the newly inserted PE. 9.1 only mentions the condition of
RECV_ES, which to me does not sounds like the newly inserted PE.
minor:
I am somewhat irritated that neither RFC8584 nor this draft have any text in the
state machiner section to indicate when/how ES routes are generated. This would
help IMHO especially in this new draft, because it is the time when the
timestamp is taken, SCT calculated and inserted into the ES route, and i guess
that that also starts the process leading to CALCULATED event on the newly
inserted router.
361 3. Synchronization Scenarios
363 Consider Figure 1 as an example, where initially PE2 has failed and
364 PE1 has taken over. This scenario illustrates the problem with the
365 DF-Election mechanism described in Section 8.5 of [RFC7432],
366 specifically in the context of the timer value configured for all PEs
367 on the Ethernet Segment.
369 Procedure based on Section 8.5 of [RFC7432] with the default 3 second
370 timer in step 2:
372 1. Initial state: PE1 is in a steady-state and PE2 is recovering
374 2. Recovery: PE2 recovers at an absolute time of t=99.
376 3. Advertisement: PE2 advertises RT-4, sent at t=100, to partner
377 PE1.
379 4. Timer Start: PE2 starts a 3 second timer to allow the reception
380 of RT-4 from other PE nodes.
382 5. Immediate carving: PE1 performs service carving immediately upon
383 RT-4 reception, i.e. t=100 plus some BGP propagation delay.
385 6. Delayed Carving: PE2 performs service carving at time t=103
387 [RFC7432] favors traffic drops over duplicate traffic. With the
388 above procedure, traffic drops will occur as part of each PE recovery
389 sequence since PE1 transitions some VLANs to Non-Designated Forwarder
390 (NDF) immediately upon RT-4 reception.
391 The timer value (default = 3 seconds) directly affects the duration
392 of the packet drops. A shorter (or zero) timer may result in
393 duplicate traffic or traffic loops.
395 Procedure based on the Service Carving Time (SCT) approach:
397 1. Initial state: PE1 is in a steady state, and PE2 is recovering
399 2. Recovery: PE2 recovers at an absolute time of t=99.
401 3. Advertisement: PE2 advertises RT-4, sent at t=100, with a target
402 SCT value of t=103 to partner PE1.
404 4. Timer Start: PE2 starts a 3 second timer to allow the reception
405 of RT-4 from other PE nodes.
minor:
IMHO, this is not a 3 second timer, but a timer with a deadline of t=103. Which
is only at most 3 seconds, depending on whether step 4. happens exactly at
t=100 or somewhat later. Practically, it would always be later. IMHO, it would
be good to emphasize on this crucial benefit of the new mechanism. Maybe need
to insert some addtl. processing delay into the section 8.5 example vs. this
example to show this difference (delay between steps 3 and 4).
407 5. Service Carving Timer: PE1 starts the service carving timer, with
408 the remaining time until t=103
410 6. Simultaneous Carving: Both PE1 and PE2 carve at an absolute time
411 of t=103
413 To maintain the preference for minimal loss over duplicate traffic,
414 PE1 should carve slightly before PE2 (with skew). The recovering PE2
415 performs both DF to NDF and NDF to DF transitions per VLAN at the
416 timer's expiry. The original PE1, which received the SCT, applies
417 the following:
419 * DF to NDF Transition(s): at t=SCT minus skew, where both PEs are
420 NDF for the skew duration.
422 * NDF to DF Transition(s): at t=SCT
minor:
In line 238, the draft says "Upon receipt of the new BGP Extended Community" ...
skew is being applied. Above text (line 419) instead defines application of
skew upon determination of the state transitiom. It may be that in all cases
where the BGP Extended Community is received, there is always only at most a DF
to NDF transition (but no NDF to DF transition), staying at NDF), but it still
is not ideal to have two inconsistent definitions when skew is being applied.
Technically i think the DF to NDF transition case is more sound than the
"receipt of the BGP extended community", aka: fix text around line 238 ?!
424 This split-behavior ensures a smooth DF role transition with minimal
425 loss.
427 Using the SCT approach, the negative effect of the timer to allow the
428 reception of RT-4 from other PE nodes is mitigated. Furthermore, the
429 BGP Ethernet Segment route (RT-4) transmission delay (from PE2 to
430 PE1) becomes a non-issue. The SCT approach shortens the 3-second
431 timer window to the order of milliseconds.
433 3.1. Concurrent Recoveries
435 In the eventuality 2 or more PEs in a peering Ethernet Segment group
436 are recovering concurrently or roughly the same time, each will
437 advertise a Service Carving Timestamp. This SCT value would
438 correspond to what each recovering PE considers the "end time" for DF
439 Election. A similar situation arises in sequentially recovering PEs,
440 when a second PE recovers approximately at the time of the first PE's
441 advertised SCT expiry, and with its own new SCT-2 outside of the
442 initial SCT window.
444 In the case of multiple concurrent DF elections, each initiated by
445 one of the recovering PEs, the SCTs must be ordered chronologically.
446 All PEs shall execute only a single DF Election at the service
447 carving time corresponding to the largest (latest) received timestamp
448 value. This DF Election will involve all active PEs in a unified DF
449 Election update.
nit:
I think the wording 444-449 is misleading/incomplete. The latest SCT timestamp
is not the top critera, but if i understand the intent correctly, each "later"
PEi also needs to be considered to be a better(best) DF than the prior PE,
right ? Aka: In your below example (line 451ff),
PE1 is DF
When PE1 receives RT-4 from PE2, PE1 will redo DF calculation and
consider PE2 to be the DF winner
When PE2 later receives RT-4 from PE3, PE1 will redo DF calculation
and now consider PE3 to be the DF winner. And only because PE3 is the
DF winner, will PE1 now also cancel the SCT for PE2.
If on the other hand, the DF HRW for PE3 would be lower than that of PE2,
than PE1 would of course redo the DF election but given how PE3 does not
show the result, this AFAIK should also mean that the SCT from PE3 should have
no impact.
Yes/No ?
In any case it would be useful to improve the description to make this clearer.
Especially if/when i misunderstood it.
451 Example:
453 1. Initial State: PE1 is in a steady state, with services elected at
454 PE1.
456 2. Recovery of PE2: PE2 recovers at time t=100 and advertises RT-4
457 with a target SCT value of t=103 to its partners (PE1)
459 3. Timer Initiation by PE2: PE2 starts a 3 second timer to allow the
460 reception of RT-4 from other PE nodes.
462 4. Timer Initiation by PE1: PE1 starts the service carving timer,
463 with the remaining time until t=103.
465 5. Recovery of PE3: PE3 recovers at time t=102 and advertises RT-4
466 with a target SCT value of t=105 to its partners (PE1, PE2).
468 6. Timer Initiation by PE3: PE3 starts a 3 second timer to allow the
469 reception of RT-4 from other PE nodes
471 7. Timer Update by PE2: PE2 cancels the running timer and starts the
472 service carving timer with the remaining time until t=105.
474 8. Timer Update by PE1: PE1 updates its service carving timer, with
475 the remaining time until t=105.
477 9. Service Carving: PE1, PE2, and PE3 perform service carving at the
478 absolute time of t=105.
480 In the eventuality a PE in a Ethernet Segment group recovers during
481 the discovery window specified in Section 8.5 of [RFC7432], and does
482 not support or advertise the T-bit, then all PEs in the current
483 peering sequence SHALL immediately revert to the default [RFC7432]
484 behavior.
486 4. Backwards Compatibility
488 For the DF election procedures to achieve global convergence and
489 unanimity within a redundancy group, it is essential that all
490 participating PEs agree on the DF election algorithm to be employed.
491 However, it is possible that some PEs may continue to use the
492 existing modulo-based DF election algorithm from [RFC7432] and not
493 utilize the new Service Carving Time (SCT) BGP extended community.
494 PEs that operate using the baseline DF election mechanism will simply
495 discard the new SCT BGP extended community as unrecognized.
496 [RFC7432] and do not rely on the new SCT BGP extended community.
498 A PE can indicate its willingness to support clock-synchronized
499 carving by signaling the new 'T' DF Election Capability and including
500 the new SCT BGP extended community along with the Ethernet Segment
501 Route (Type-4). If one or more PEs attached to the Ethernet Segment
502 do not signal T=1, then all PEs in the Ethernet Segment SHALL revert
503 to the timer-based approach as specified in [RFC7432]. This
504 reversion is particularly crucial in preventing VLAN shuffling when
505 more than two PEs are involved.
507 5. Security Considerations
509 The mechanisms in this document use EVPN control plane as defined in
510 [RFC7432]. Security considerations described in [RFC7432] are
511 equally applicable.
513 For the new SCT Extended Community, attack vectors may be setting the
514 value to zero, to a value in the past or to large times in the
515 future. The procedures in this document address implicitly what
516 occurs with a carving time in the past, as this would be a naturally
517 occurring event with a large BGP propagation delay: the receiving PE
518 SHALL treat the DF Election at the peer as having occurred already,
519 and proceed without starting any timer to futher delay service
520 carving. For timestamp values in the future, a rogue PE may be
521 advertising a value inconsistent with its local behavior. This is no
522 different than a rogue PE setting all its DF Election results
523 inconstently to its peers using (or ignoring adherence to) the
524 procedures from [RFC7432], and the result would similarly be
525 duplicate or dropped traffic. It is left to implementations to
526 decide what consists an "unreasonably large" SCT value.
528 This document uses MPLS and IP-based tunnel technologies to support
529 data plane transport. Security considerations described in [RFC7432]
530 and in [RFC8365] are equally applicable.
532 6. IANA Considerations
534 IANA maintains the "EVPN Extended Community Sub-Types" registry set
535 up by [RFC7153]. IANA is requested to confirm the First Come First
536 Served assignment as follows:
538 Sub-Type Value Name Reference Date
539 -------------- ------------------------- ------------- ----
540 0x0F Service Carving Timestamp This document TBD
542 IANA should replace the field TBD with the date of publicaton of this
543 document as an RFC.
545 IANA maintains the "DF Election Capabilities" registry set up by
546 [RFC8584]. IANA is requested to make the following assignment from
547 this registry:
549 Bit Name Reference Date
550 ---- ---------------- ------------- ----
551 3 Time Synchronization This document TBD
553 IANA should replace the field TBD with the date of publicaton of this
554 document as an RFC.
556 7. Normative References
558 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
559 Requirement Levels", BCP 14, RFC 2119,
560 DOI 10.17487/RFC2119, March 1997,
561 <https://www.rfc-editor.org/info/rfc2119>.
563 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch,
564 "Network Time Protocol Version 4: Protocol and Algorithms
565 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010,
566 <https://www.rfc-editor.org/info/rfc5905>.
568 [RFC7153] Rosen, E. and Y. Rekhter, "IANA Registries for BGP
569 Extended Communities", RFC 7153, DOI 10.17487/RFC7153,
570 March 2014, <https://www.rfc-editor.org/info/rfc7153>.
572 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A.,
573 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based
574 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February
575 2015, <https://www.rfc-editor.org/info/rfc7432>.
577 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
578 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
579 May 2017, <https://www.rfc-editor.org/info/rfc8174>.
581 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R.,
582 Uttaro, J., and W. Henderickx, "A Network Virtualization
583 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365,
584 DOI 10.17487/RFC8365, March 2018,
585 <https://www.rfc-editor.org/info/rfc8365>.
587 [RFC8584] Rabadan, J., Ed., Mohanty, S., Ed., Sajassi, A., Drake,
588 J., Nagaraj, K., and S. Sathappan, "Framework for Ethernet
589 VPN Designated Forwarder Election Extensibility",
590 RFC 8584, DOI 10.17487/RFC8584, April 2019,
591 <https://www.rfc-editor.org/info/rfc8584>.
593 Appendix A. Contributors
595 In addition to the authors listed on the front page, the following
596 co-authors have also contributed substantially to this document:
598 Gaurav Badoni
599 Cisco
601 Email: gbadoni@xxxxxxxxx
603 Dhananjaya Rao
604 Cisco
606 Email: dhrao@xxxxxxxxx
608 Appendix B. Acknowledgements
610 Authors would like to acknowledge helpful comments and contributions
611 of Satya Mohanty and Bharath Vasudevan. Also thank you to Anoop
612 Ghanwani and Gunter van de Velde for their thorough review with
613 valuable comments and corrections.
615 Authors' Addresses
617 Patrice Brissette (editor)
618 Cisco
619 Email: pbrisset@xxxxxxxxx
621 Ali Sajassi
622 Cisco
623 Email: sajassi@xxxxxxxxx
625 Luc Andre Burdet
626 Cisco
627 Email: lburdet@xxxxxxxxx
629 John Drake
630 Independent
631 Email: je_drake@xxxxxxxxx
633 Jorge Rabadan
634 Nokia
635 Email: jorge.rabadan@xxxxxxxxx
EOF
_______________________________________________
BESS mailing list -- bess@xxxxxxxx
To unsubscribe send an email to bess-leave@xxxxxxxx
|