Excellent, thanks so much for your help on this. Yours Irrespectively, John Juniper Business Use Only > -----Original Message----- > From: Gyan Mishra <hayabusagsm@xxxxxxxxx> > Sent: Tuesday, May 18, 2021 4:28 PM > To: Lars Eggert <lars@xxxxxxxxxx> > Cc: General Area Review Team <gen-art@xxxxxxxx>; bess@xxxxxxxx; draft-ietf- > bess-datacenter-gateway.all@xxxxxxxx; last-call@xxxxxxxx > Subject: Re: [Last-Call] Genart last call review of draft-ietf-bess-datacenter- > gateway-10 > > [External Email. Be cautious of content] > > > Hi Lars’s & DC Gateway authors > > I will be responding back today to the Gen-Art original email I sent with final > comments and hope the final comments will help improve the document. > > I will also address the comments from John Scudder related to GW failover as > well as Alvaro’s comments related to tunnel encapsulation attribute BGP prefix > sid Sub-TLV limitations. Also will add new text recommendations related to RFC > 2119 MUST / SHOULD language to help improve the document. > > > Thank you > > Gyan > On Tue, May 18, 2021 at 3:31 AM Lars Eggert <lars@xxxxxxxxxx> wrote: > > > Gyan, thank you for your review and thank you all for the following > > discussion. I have entered a No Objection ballot for this document > > based on the current status of the discussion. > > > > Lars > > > > > > > On 2021-4-29, at 8:46, Gyan Mishra via Datatracker > > > <noreply@xxxxxxxx> > > wrote: > > > > > > Reviewer: Gyan Mishra > > > Review result: Not Ready > > > > > > I am the assigned Gen-ART reviewer for this draft. The General Area > > > Review Team (Gen-ART) reviews all IETF documents being processed by > > > the IESG for the IETF Chair. Please treat these comments just like > > > any other last call comments. > > > > > > For more information, please see the FAQ at > > > > > > > <https://urldefense.com/v3/__https://trac.ietf.org/trac/gen/wiki/GenArtfaq__; > !!NEt6yMaO-gk!RIcJvmiBoFFiuLezPbzRuUXybG_QHD8PujD7pROBUPot5dc9nX- > rMTiD1THCYZA$ >. > > > > > > Document: draft-ietf-bess-datacenter-gateway-?? > > > Reviewer: Gyan Mishra > > > Review Date: 2021-04-28 > > > IETF LC End Date: 2021-04-29 > > > IESG Telechat date: Not scheduled for a telechat > > > > > > Summary: > > > This document defines a mechanism using the BGP Tunnel Encapsulation > > > attribute to allow each gateway router to advertise the routes to the > > > prefixes in the Segment Routing domains to which it provides access, > > > and also to advertise on behalf of each other gateway to the same > > > Segment Routing domain. > > > > > > This draft needs to provide some more clarity as far as the use case > > > and > > where > > > this would as well as how it would be used and implemented. From > > reading the > > > specification it appears there are some technical gaps that exist. > > > There > > are > > > some major issues with this draft. I don’t think this draft is ready yet. > > > > > > Major issues: > > > > > > Abstract comments: > > > It is mentioned that the use of Segment Routing within the Data Center. > > Is > > > that a requirement for this specification to work as this is > > > mentioned throughout the draft? Technically I would think the > > > concept of the > > discovery > > > of the gateways is feasible without the requirement of SR within the > > > Data Center. > > > > > > The concept of load balancing is a bigger issue brought up in this > > > draft > > as the > > > problem statement and what this draft is trying to solve which I > > > will > > address > > > in the introduction comments. > > > > > > Introduction comments: > > > In the introduction the use case is expanded much further to any > > functional > > > edge AS verbiage below. > > > > > > OLD > > > > > > “SR may also be operated in other domains, such as access networks. > > > Those domains also need to be connected across backbone networks > > > through gateways. For illustrative purposes, consider the Ingress > > > and Egress SR Domains shown in Figure 1 as separate ASes. The > > > various ASes that provide connectivity between the Ingress and Egress > > > Domains could each be constructed differently and use different > > > technologies such as IP, MPLS with global table routing native BGP to > > > the edge, MPLS IP VPN, SR-MPLS IP VPN, or SRv6 IP VPN” > > > > > > This paragraph expands the use case to any ingress or egress stub > > > domain > > Data > > > Center, Access or any. If that is the case should the draft name > > > change > > to > > > maybe a “stub edge domain services discovery”. As this draft can be > > used for > > > any I would not preclude any use case and make the GW discovery open > > > to > > be used > > > for any service GW edge function and change the draft name to > > > something > > more > > > appropriate. > > > > > > This paragraph also states for illustrative purposes which is fine > > > but > > then it > > > expands the overlay/underlay use cases. I believe this use case can > > > only > > be > > > used for any technology that has an overlay/underlay which would > > preclude any > > > use case with just an underlay global table routing such as what is > > mentioned > > > “IP, MPLS with global table routing native BGP to the edge. The IP > > > or > > global > > > table routing would be an issue as this specification requires > > > setting a > > RT and > > > an export/import RT policy for the discover of routes advertised by > > > the > > GWs. > > > As I don’t think this solution from what I can tell would work > > technically for > > > global table routing I will update the above paragraph to preclude > > global table > > > routing. We can add back in we can figure that out but I don’t > > > think any public or private operator would change from global table > > > carrying all > > BGP > > > prefixes in the underlay now drastic change to VPN overlay pushing > > > all > > the > > > any-any prefixes into the overlay as that would be a prerequisite to > > > be > > able to > > > use this draft. > > > > > >> From this point forward I am going to assume we are using VPN > > >> overlay > > > technology such as SR or MPLS. > > > > > > NEW > > > > > > “SR may also be operated in other domains, such as access networks. > > > Those domains also need to be connected across backbone networks > > > through gateways. For illustrative purposes, consider the Ingress > > > and Egress SR Domains shown in Figure 1 as separate ASes. The > > > various ASs that provide connectivity between the Ingress and Egress > > > Domains could be two as shown in Figure-1 or could be many more as > > exists > > > with the public internet use case, and each may be constructed > > differently > > > and use different technologies such as MPLS IP VPN, SR-MPLS IP > > > VPN, or > > SRv6 > > > IP VPN” with a “BGP Free” Core. > > > > > > This may work without “BGP Free” core but I think to simplify the > > > design complexity I think constraining to “BGP Free” core transport layer. > > SR-TE path > > > steering as well gets much more complicated if all P routers are > > > running > > BGP as > > > well. I think in this example we can even explicitly say this > > > example > > shows the > > > public internet as that would be one of the primary use cases. > > > > > > This paragraph is confusing to the reader > > > > > > As a precursor to this paragraph I think it maybe a good idea to > > > state > > that we > > > are talking global table IP only routing or VPN overlay technology > > > with > > SR/MPLS > > > underlay transport. That will make this section much easier to > > understand. > > > > > > Figure 1 drawing you should give a AS number to both the ingress > > > domain > > and > > > egress domain so the reader does not have to make assumptions if it > > > iBGP > > or > > > eBGP connected to the egress or ingress domain and state eBGP in the > > > text below. Lets also call the intermediate ASNs in the middle as > > > depicted > > in the > > > diagram could be 2 as shown illustratively but could be many > > > operator > > domains > > > such as in the case of traversing the public internet. In the drawing > > I would > > > replace ASBR for PE as per this solution as I am stating it has to > > > be a > > VPN > > > overlay paradigm and not global routing. Also in the VPN overlay > > scenario when > > > you are doing any type of inter-as peering the inter-AS peering is > > > almost always between PE’s and not a separate dedicated device > > > serving a special “ASBR-ASBR” function as the PE is acting as the > > > border node providing the “ASBR” type function. So in the re-write > > > I am assuming the drawing has > > been > > > updated changing ASBR to PE. Lets give each node a number so that > > > we > > can be > > > clear in the text exactly what node we are referring to. In the > > > drawing > > please > > > update that GW1 peers to PE1 and GW2 peers to PE2 and GW3 peers to PE3. > > GW3 > > > also peers to GW4 and GW2 peers to GW5 which GW4 and GW5 are part > > > of > > AS3. In > > > the AS1-AS2 peering top peer would be PE6 peers to PE8 and bottom > > > peer > > PE7 > > > peers to PE9. So PE6 and PE7 are in AS1 and PE8 and PE9 are in AS2. > > > I > > made > > > the bottom to ASBRs in AS3 for the selective deterministic load > > balancing now > > > calling them GW4 and GW5 used later in the problem statement. > > > > > > One major problem with this problem statement description is that it > > > is incorrect as far as GW load balancing that it does not work today > > > in the topology given in Figure-1. The function of edge GW load > > > balancing is > > based on > > > the iBGP path tie breaker lowest common denominator in the BGP path > > selection > > > which is lowest IGP underlay metric and as long as the metric is > > > equal > > and you > > > have iBGP multipath enabled you now can load balance to egress PE1 > > > and > > PE2 > > > endpoints. So in this case flows coming from AS1 into AS2 hit a P > > intermediate > > > router which has iBGP multipath enabled and has lets say equal cost > > > for > > route > > > to the next hop attribute assuming next-hop-self is set so the cost > > > to > > > loopback0 on PE1 and cost to loopback0 on PE2 is lets say 10, so now > > > you > > have a > > > BGP multipath. What is required though is the RD has to be unique > > > in a > > “BGP > > > Free” core RR environment where all PE’s route-reflector-clients > > > peer to > > the RR > > > and for all the paths that are advertised to the RR to be reflected > > > to > > all the > > > egress PE edges the RD must be unique for the RR to reflect all paths. > > BGP > > > add-paths is only used if you have Primary and Backup routing setup > > > where > > > PE1-GW1 has a 0x prepend and PE2-GW2 has 1x prepend so now with BGP > > add-paths > > > along with BGP PIC Edge you now have a edge pre-programmed backup > path. > > So the > > > add-paths is not necessarily something that helps for load balancing > > > and > > is in > > > fact orthogonal to load balancing as it for Primary / Backup routing > > > and > > not > > > Active/Active load balancing routing where load balancing with VPN > > overlay is > > > simply achieved with unique RD per PE and iBGP multipath and equal > > > cost > > paths > > > to the underlay recursive IGP learned next-hop-attribute in this > > > case > > the PE > > > loopback 0 per the next hop rewrite via “next-hop-sellf” done on the > > PE-RR > > > peering in a standard VPN overlay topology. As far as load balancing > > being > > > accomplished in the underlay what I have stated is independent of > > > SR-TE > > however > > > with SR-TE candidate path the load balancing ECMP spray to egress PE > > egress GW > > > AS can also happen as well with prefix-sid. > > > > > > OLD > > > Suppose that there are two gateways, GW1 and GW2 as shown in > > > Figure 1, for a given egress SR domain and that they each advertise a > > > route to prefix X which is located within the egress SR domain with > > > each setting itself as next hop. One might think that the GWs for X > > > could be inferred from the routes' next hop fields, but typically it > > > is not the case that both routes get distributed across the backbone: > > > rather only the best route, as selected by BGP, is distributed. This > > > precludes load balancing flows across both GWs. > > > > > > I am rewriting the text in the NEW as there is some discrepancy in > > > the > > routes > > > being distributed across the backbone and what gets distributed. So > > > I am completely re-writing to make it more clear what we are trying > > > to state > > here as > > > the text appears technically to be incorrect. To help state the > > > flow > > will use > > > the BGP route flow to help depict the routing and try to get to the > > problem > > > statement we are trying to portray. > > > > > > NEW > > > > > > Suppose that there are two gateways, GW1 and GW2 as shown in > > > Figure 1, for a given egress SR domain and each gateway advertises > > > via > > EBGP > > > a VPN prefix X to AS2 core domain via EBGP with underlay next hop > > > set > > to GW1 > > > or GW2. In this case we are Active / Active load balancing with > > > PE1 > > and PE2 > > > receives the VPN prefix and advertised the VPN prefix X into the > > domain with > > > next-hop-self set on the PE-RR peering to the PE’s loopback0. The > > > P > > routers > > > within the domain have ECMP path with IGP metric tie to the egress > > > PE1 > > and > > > egress PE2 for VPN Prefix X learned from GW1 and GW2. SR-TE path > > > can > > now be > > > stitched from GW3 to PE3 SR-TE Segment-1 to PE3 to PE6 and PE7 > > Segment-2 to > > > PE8 and PE9 to Egress Domain via PE1 and PE2 to GW1 and GW2. In > > > this > > case > > > however we don’t want the traffic to be steered via SR-TE Load > > balanced via > > > ingress GW3 and want to take GW3 out of rotation and load balance > > traffic to > > > GW4 and GW5 instead. > > > > > > **Text above provides the updated selective deterministic gateway > > steering > > > described below to achieve the goal. I think that may have been the > > intent of > > > the authors and I am just making it more clear** > > > > > > As for problem statement as GW load balancing can occur in the > > > underlay > > as > > > stated easily that is not the problem. > > > > > > In my mind I am thinking the problem statement that we want to > > > describe > > in both > > > the Abstract and Introduction is not vanilla simple gateway load > > balancing but > > > rather a predictable deterministic method of selecting gateways to > > > be > > used that > > > is each VPN prefix now has a descriptor attached - tunnel > > > encapsulation attribute which contains multiple TLVs one or more for > > > each “selected > > gateway” > > > with each tunnel TLV contains an egress tunnel endpoint sub-tlv that > > identifies > > > the gateway for the tunnel. Maybe we can have in the sub-tlv a > > > priority > > field > > > for pecking order preference of which GWs are pushed up into the GW hash > > > selected for the SR-ERO path to be stitched end to end. So lets say > > you had > > > 10 GWs and you break them up into 2 tiers or multi tiers and have > > > maybe > > gateway > > > 1-5 are primary and 6-10 are backup and that could be do to various > > reasons so > > > you can basically pick and choose based on priority which GW that > > > gets > > added to > > > the GW hash. > > > > > > I have some feedback and comments on the solution and how best to > > > write > > the > > > verbiage to make it more clear to the reader. > > > > > > I think in the solution as far s the RT to attach for the GW auto > > discovery. > > > So with this new RT we are essentially creating a new VPN RIB that > > > has > > prefixes > > > from all the selected gateways that are discovered from the tunnel > > > encapsulation attribute TLV. > > > > > > In the text here what is really confusing is if the tunnel > > > encapsulation attribute is being attached to the underlay recursive route to > next hop > > > attribute or the VPN overlay prefix. So the reason I am thinking it is > > being > > > attached to the VPN overlay prefix and not the underlay next hop > > attribute is > > > how would you now create another transport RIB and if you are > > > creating a > > new > > > transport RIB there is already a draft defined by Kaliraj > > > Vairavakkalai > > or > > > BGP-LU SAFI 4 labeled unicast that exits today to advertise next > > > hops > > between > > > domains for an end to end LSP load balanced path. > > > > > > > > https://urldefense.com/v3/__https://tools.ietf.org/html/draft-kaliraj- > > idr-bgp-classful-transport-planes-07__;!!NEt6yMaO-gk!RIcJvmiBoFFiuLezP > > bzRuUXybG_QHD8PujD7pROBUPot5dc9nX-rMTiD7W4i_nA$ > > > > > > IANA code point below > > > 76 Classful-Transport SAFI > > > [draft-kaliraj-idr-bgp-classful-transport-planes-00] > > > > > > Also in line with CT another option is BGP-LU SAFI 4 to import the > > loopbacks > > > between domains which is the next hop attribute to be advertised > > > into > > the core > > > end to end LSP. So the BGP-LU SAFI RIB could be used for the next > > > GW > > next hop > > > advertisement between domains so that there is visibility of all the > > egress PE > > > loopback0 between domains. So you can either stitch the LSP segmented > > LSP > > > like inter-as option-b SR-TE stitched and use nex-hop self PE-RR > > > next-hop rewrite on each of the PEs within the internet domain or > > > you could > > import all > > > the PE loopback from all ingress and egress domains into the > > > internet > > domain > > > similar to inter-as opt-c create end to end LSP instantiate an end > > > to > > end SR-TE > > > path. > > > > > > Maybe you could attach the RT tunnel encapsulation attribute tunnel > > > tlv endpoint tlv to the VPN overlay prefix. Not sure how that would > > > be > > beneficial > > > the underlay steers the VPN overlay. > > > > > > So maybe you could couple the VPN overlay new GW RIB RT to the > > > transport Underlay CT CLAS RIB or BGP-LU RIB coupling may have some > > > benefit but > > that > > > would have to be investigated but I think is out of scope of the > > > goals > > of this > > > draft. > > > > > > I think we first have to figure out the goal and purpose of this > > > draft > > by the > > > authors and how the GW discovery should work in light of the CT > > > class CT > > RIB > > > AFI/SAFI codepoint draft that exists today as well as the BGP-LU > > > option > > for > > > next hop advertisement within the internet domain. > > > > > > Section 3 comments > > > > > > “Each GW is configured with an identifier for the SR domain. That > > > identifier is common across all GWs to the domain (i.e., the same > > > identifier is used by all GWs to the same SR domain), and unique > > > across all SR domains that are connected (i.e., across all GWs to > > > all SR domains that are interconnected). > > > > > > **No issues with the above** > > > > > > A route target ([RFC4360]) is attached to each GW's auto-discovery > > > route and has its value set to the SR domain identifier. > > > > > > **So here if the RT is attached to the GW auto-discovery route we > > > need > > to state > > > is that the underlay route and that the PE does a next-hop-self > > > rewrite > > of the > > > eBGP link to the BGP egress domain next hop to the loopback0 so the > > > GW > > next hop > > > that we are tracking of all the ingress and egress PE domains is the > > egress and > > > ingress PE loopback0.** > > > > > > Each GW constructs an import filtering rule to import any route > > > that carries a route target with the same SR domain identifier > > > that the GW itself uses. This means that only these GWs will > > > import those routes, and that all GWs to the same SR domain will > > > import each other's routes and will learn (auto-discover) the > > > current set of active GWs for the SR domain.” > > > > > > **So if this is the case and we are tracking the underlay RIB and > > > attach > > a > > > route target to all the ingress PE & P next hops which is loopback0 > > > = > > this is > > > literally identical to BGP-LU importing all the loopbacks between > > domains or > > > using CT class** There is no need for this feature to use the tunnel > > > encapsulation attribute. I am not following why you would not use > > BGP-LU or CT > > > clas RIB.** > > > > > > “To avoid the side effect of applying the Tunnel Encapsulation > > > attribute to any packet that is addressed to the GW itself, the GW > > > SHOULD use a different loopback address for packets intended for it.” > > > > > > **I don’t understand this statement as the next hop is the ingress > > > and > > egress > > > PE loopback0 that is the next hop being tracked for the gateway load > > balancing. > > > The GW device subnet between the GW and PE is not advertised into > > > the > > internet > > > domain as we do next-hop-self on the PE PE-RR iBGP peering and so > > > the GW > > to PE > > > subnet is not advertised.** Looking at it a second time I think we are > > > thinking here BGP-LU inter-as opt c style import of loops between > > domains and > > > so instead of importing the loop0 which carries all packets on the > > > GW > > device > > > use a different loopback GW1 so it does not carry the FEC of all > > > BAU > > packets > > > similar concept utilized in RSVP-TE to VPN mapping "per-vrf TE" > > > concept -- last-call mailing list last-call@xxxxxxxx https://www.ietf.org/mailman/listinfo/last-call