Hi Lars’s & DC Gateway authors
I will be responding back today to the Gen-Art original email I sent with final comments and hope the final comments will help improve the document.
I will also address the comments from John Scudder related to GW failover as well as Alvaro’s comments related to tunnel encapsulation attribute BGP prefix sid Sub-TLV limitations. Also will add new text recommendations related to RFC 2119 MUST / SHOULD language to help improve the document.
Thank you
Gyan
-- On Tue, May 18, 2021 at 3:31 AM Lars Eggert <lars@xxxxxxxxxx> wrote:
Gyan, thank you for your review and thank you all for the following discussion. I have entered a No Objection ballot for this document based on the current status of the discussion.
Lars
> On 2021-4-29, at 8:46, Gyan Mishra via Datatracker <noreply@xxxxxxxx> wrote:
>
> Reviewer: Gyan Mishra
> Review result: Not Ready
>
> I am the assigned Gen-ART reviewer for this draft. The General Area
> Review Team (Gen-ART) reviews all IETF documents being processed
> by the IESG for the IETF Chair. Please treat these comments just
> like any other last call comments.
>
> For more information, please see the FAQ at
>
> <https://trac.ietf.org/trac/gen/wiki/GenArtfaq>.
>
> Document: draft-ietf-bess-datacenter-gateway-??
> Reviewer: Gyan Mishra
> Review Date: 2021-04-28
> IETF LC End Date: 2021-04-29
> IESG Telechat date: Not scheduled for a telechat
>
> Summary:
> This document defines a mechanism using the BGP Tunnel Encapsulation
> attribute to allow each gateway router to advertise the routes to the
> prefixes in the Segment Routing domains to which it provides access,
> and also to advertise on behalf of each other gateway to the same
> Segment Routing domain.
>
> This draft needs to provide some more clarity as far as the use case and where
> this would as well as how it would be used and implemented. From reading the
> specification it appears there are some technical gaps that exist. There are
> some major issues with this draft. I don’t think this draft is ready yet.
>
> Major issues:
>
> Abstract comments:
> It is mentioned that the use of Segment Routing within the Data Center. Is
> that a requirement for this specification to work as this is mentioned
> throughout the draft? Technically I would think the concept of the discovery
> of the gateways is feasible without the requirement of SR within the Data
> Center.
>
> The concept of load balancing is a bigger issue brought up in this draft as the
> problem statement and what this draft is trying to solve which I will address
> in the introduction comments.
>
> Introduction comments:
> In the introduction the use case is expanded much further to any functional
> edge AS verbiage below.
>
> OLD
>
> “SR may also be operated in other domains, such as access networks.
> Those domains also need to be connected across backbone networks
> through gateways. For illustrative purposes, consider the Ingress
> and Egress SR Domains shown in Figure 1 as separate ASes. The
> various ASes that provide connectivity between the Ingress and Egress
> Domains could each be constructed differently and use different
> technologies such as IP, MPLS with global table routing native BGP to
> the edge, MPLS IP VPN, SR-MPLS IP VPN, or SRv6 IP VPN”
>
> This paragraph expands the use case to any ingress or egress stub domain Data
> Center, Access or any. If that is the case should the draft name change to
> maybe a “stub edge domain services discovery”. As this draft can be used for
> any I would not preclude any use case and make the GW discovery open to be used
> for any service GW edge function and change the draft name to something more
> appropriate.
>
> This paragraph also states for illustrative purposes which is fine but then it
> expands the overlay/underlay use cases. I believe this use case can only be
> used for any technology that has an overlay/underlay which would preclude any
> use case with just an underlay global table routing such as what is mentioned
> “IP, MPLS with global table routing native BGP to the edge. The IP or global
> table routing would be an issue as this specification requires setting a RT and
> an export/import RT policy for the discover of routes advertised by the GWs.
> As I don’t think this solution from what I can tell would work technically for
> global table routing I will update the above paragraph to preclude global table
> routing. We can add back in we can figure that out but I don’t think any
> public or private operator would change from global table carrying all BGP
> prefixes in the underlay now drastic change to VPN overlay pushing all the
> any-any prefixes into the overlay as that would be a prerequisite to be able to
> use this draft.
>
>> From this point forward I am going to assume we are using VPN overlay
> technology such as SR or MPLS.
>
> NEW
>
> “SR may also be operated in other domains, such as access networks.
> Those domains also need to be connected across backbone networks
> through gateways. For illustrative purposes, consider the Ingress
> and Egress SR Domains shown in Figure 1 as separate ASes. The
> various ASs that provide connectivity between the Ingress and Egress
> Domains could be two as shown in Figure-1 or could be many more as exists
> with the public internet use case, and each may be constructed differently
> and use different technologies such as MPLS IP VPN, SR-MPLS IP VPN, or SRv6
> IP VPN” with a “BGP Free” Core.
>
> This may work without “BGP Free” core but I think to simplify the design
> complexity I think constraining to “BGP Free” core transport layer. SR-TE path
> steering as well gets much more complicated if all P routers are running BGP as
> well. I think in this example we can even explicitly say this example shows the
> public internet as that would be one of the primary use cases.
>
> This paragraph is confusing to the reader
>
> As a precursor to this paragraph I think it maybe a good idea to state that we
> are talking global table IP only routing or VPN overlay technology with SR/MPLS
> underlay transport. That will make this section much easier to understand.
>
> Figure 1 drawing you should give a AS number to both the ingress domain and
> egress domain so the reader does not have to make assumptions if it iBGP or
> eBGP connected to the egress or ingress domain and state eBGP in the text
> below. Lets also call the intermediate ASNs in the middle as depicted in the
> diagram could be 2 as shown illustratively but could be many operator domains
> such as in the case of traversing the public internet. In the drawing I would
> replace ASBR for PE as per this solution as I am stating it has to be a VPN
> overlay paradigm and not global routing. Also in the VPN overlay scenario when
> you are doing any type of inter-as peering the inter-AS peering is almost
> always between PE’s and not a separate dedicated device serving a special
> “ASBR-ASBR” function as the PE is acting as the border node providing the
> “ASBR” type function. So in the re-write I am assuming the drawing has been
> updated changing ASBR to PE. Lets give each node a number so that we can be
> clear in the text exactly what node we are referring to. In the drawing please
> update that GW1 peers to PE1 and GW2 peers to PE2 and GW3 peers to PE3. GW3
> also peers to GW4 and GW2 peers to GW5 which GW4 and GW5 are part of AS3. In
> the AS1-AS2 peering top peer would be PE6 peers to PE8 and bottom peer PE7
> peers to PE9. So PE6 and PE7 are in AS1 and PE8 and PE9 are in AS2. I made
> the bottom to ASBRs in AS3 for the selective deterministic load balancing now
> calling them GW4 and GW5 used later in the problem statement.
>
> One major problem with this problem statement description is that it is
> incorrect as far as GW load balancing that it does not work today in the
> topology given in Figure-1. The function of edge GW load balancing is based on
> the iBGP path tie breaker lowest common denominator in the BGP path selection
> which is lowest IGP underlay metric and as long as the metric is equal and you
> have iBGP multipath enabled you now can load balance to egress PE1 and PE2
> endpoints. So in this case flows coming from AS1 into AS2 hit a P intermediate
> router which has iBGP multipath enabled and has lets say equal cost for route
> to the next hop attribute assuming next-hop-self is set so the cost to
> loopback0 on PE1 and cost to loopback0 on PE2 is lets say 10, so now you have a
> BGP multipath. What is required though is the RD has to be unique in a “BGP
> Free” core RR environment where all PE’s route-reflector-clients peer to the RR
> and for all the paths that are advertised to the RR to be reflected to all the
> egress PE edges the RD must be unique for the RR to reflect all paths. BGP
> add-paths is only used if you have Primary and Backup routing setup where
> PE1-GW1 has a 0x prepend and PE2-GW2 has 1x prepend so now with BGP add-paths
> along with BGP PIC Edge you now have a edge pre-programmed backup path. So the
> add-paths is not necessarily something that helps for load balancing and is in
> fact orthogonal to load balancing as it for Primary / Backup routing and not
> Active/Active load balancing routing where load balancing with VPN overlay is
> simply achieved with unique RD per PE and iBGP multipath and equal cost paths
> to the underlay recursive IGP learned next-hop-attribute in this case the PE
> loopback 0 per the next hop rewrite via “next-hop-sellf” done on the PE-RR
> peering in a standard VPN overlay topology. As far as load balancing being
> accomplished in the underlay what I have stated is independent of SR-TE however
> with SR-TE candidate path the load balancing ECMP spray to egress PE egress GW
> AS can also happen as well with prefix-sid.
>
> OLD
> Suppose that there are two gateways, GW1 and GW2 as shown in
> Figure 1, for a given egress SR domain and that they each advertise a
> route to prefix X which is located within the egress SR domain with
> each setting itself as next hop. One might think that the GWs for X
> could be inferred from the routes' next hop fields, but typically it
> is not the case that both routes get distributed across the backbone:
> rather only the best route, as selected by BGP, is distributed. This
> precludes load balancing flows across both GWs.
>
> I am rewriting the text in the NEW as there is some discrepancy in the routes
> being distributed across the backbone and what gets distributed. So I am
> completely re-writing to make it more clear what we are trying to state here as
> the text appears technically to be incorrect. To help state the flow will use
> the BGP route flow to help depict the routing and try to get to the problem
> statement we are trying to portray.
>
> NEW
>
> Suppose that there are two gateways, GW1 and GW2 as shown in
> Figure 1, for a given egress SR domain and each gateway advertises via EBGP
> a VPN prefix X to AS2 core domain via EBGP with underlay next hop set to GW1
> or GW2. In this case we are Active / Active load balancing with PE1 and PE2
> receives the VPN prefix and advertised the VPN prefix X into the domain with
> next-hop-self set on the PE-RR peering to the PE’s loopback0. The P routers
> within the domain have ECMP path with IGP metric tie to the egress PE1 and
> egress PE2 for VPN Prefix X learned from GW1 and GW2. SR-TE path can now be
> stitched from GW3 to PE3 SR-TE Segment-1 to PE3 to PE6 and PE7 Segment-2 to
> PE8 and PE9 to Egress Domain via PE1 and PE2 to GW1 and GW2. In this case
> however we don’t want the traffic to be steered via SR-TE Load balanced via
> ingress GW3 and want to take GW3 out of rotation and load balance traffic to
> GW4 and GW5 instead.
>
> **Text above provides the updated selective deterministic gateway steering
> described below to achieve the goal. I think that may have been the intent of
> the authors and I am just making it more clear**
>
> As for problem statement as GW load balancing can occur in the underlay as
> stated easily that is not the problem.
>
> In my mind I am thinking the problem statement that we want to describe in both
> the Abstract and Introduction is not vanilla simple gateway load balancing but
> rather a predictable deterministic method of selecting gateways to be used that
> is each VPN prefix now has a descriptor attached - tunnel encapsulation
> attribute which contains multiple TLVs one or more for each “selected gateway”
> with each tunnel TLV contains an egress tunnel endpoint sub-tlv that identifies
> the gateway for the tunnel. Maybe we can have in the sub-tlv a priority field
> for pecking order preference of which GWs are pushed up into the GW hash
> selected for the SR-ERO path to be stitched end to end. So lets say you had
> 10 GWs and you break them up into 2 tiers or multi tiers and have maybe gateway
> 1-5 are primary and 6-10 are backup and that could be do to various reasons so
> you can basically pick and choose based on priority which GW that gets added to
> the GW hash.
>
> I have some feedback and comments on the solution and how best to write the
> verbiage to make it more clear to the reader.
>
> I think in the solution as far s the RT to attach for the GW auto discovery.
> So with this new RT we are essentially creating a new VPN RIB that has prefixes
> from all the selected gateways that are discovered from the tunnel
> encapsulation attribute TLV.
>
> In the text here what is really confusing is if the tunnel encapsulation
> attribute is being attached to the underlay recursive route to next hop
> attribute or the VPN overlay prefix. So the reason I am thinking it is being
> attached to the VPN overlay prefix and not the underlay next hop attribute is
> how would you now create another transport RIB and if you are creating a new
> transport RIB there is already a draft defined by Kaliraj Vairavakkalai or
> BGP-LU SAFI 4 labeled unicast that exits today to advertise next hops between
> domains for an end to end LSP load balanced path.
>
> https://tools.ietf.org/html/draft-kaliraj-idr-bgp-classful-transport-planes-07
>
> IANA code point below
> 76 Classful-Transport SAFI
> [draft-kaliraj-idr-bgp-classful-transport-planes-00]
>
> Also in line with CT another option is BGP-LU SAFI 4 to import the loopbacks
> between domains which is the next hop attribute to be advertised into the core
> end to end LSP. So the BGP-LU SAFI RIB could be used for the next GW next hop
> advertisement between domains so that there is visibility of all the egress PE
> loopback0 between domains. So you can either stitch the LSP segmented LSP
> like inter-as option-b SR-TE stitched and use nex-hop self PE-RR next-hop
> rewrite on each of the PEs within the internet domain or you could import all
> the PE loopback from all ingress and egress domains into the internet domain
> similar to inter-as opt-c create end to end LSP instantiate an end to end SR-TE
> path.
>
> Maybe you could attach the RT tunnel encapsulation attribute tunnel tlv
> endpoint tlv to the VPN overlay prefix. Not sure how that would be beneficial
> the underlay steers the VPN overlay.
>
> So maybe you could couple the VPN overlay new GW RIB RT to the transport
> Underlay CT CLAS RIB or BGP-LU RIB coupling may have some benefit but that
> would have to be investigated but I think is out of scope of the goals of this
> draft.
>
> I think we first have to figure out the goal and purpose of this draft by the
> authors and how the GW discovery should work in light of the CT class CT RIB
> AFI/SAFI codepoint draft that exists today as well as the BGP-LU option for
> next hop advertisement within the internet domain.
>
> Section 3 comments
>
> “Each GW is configured with an identifier for the SR domain. That
> identifier is common across all GWs to the domain (i.e., the same
> identifier is used by all GWs to the same SR domain), and unique
> across all SR domains that are connected (i.e., across all GWs to
> all SR domains that are interconnected).
>
> **No issues with the above**
>
> A route target ([RFC4360]) is attached to each GW's auto-discovery
> route and has its value set to the SR domain identifier.
>
> **So here if the RT is attached to the GW auto-discovery route we need to state
> is that the underlay route and that the PE does a next-hop-self rewrite of the
> eBGP link to the BGP egress domain next hop to the loopback0 so the GW next hop
> that we are tracking of all the ingress and egress PE domains is the egress and
> ingress PE loopback0.**
>
> Each GW constructs an import filtering rule to import any route
> that carries a route target with the same SR domain identifier
> that the GW itself uses. This means that only these GWs will
> import those routes, and that all GWs to the same SR domain will
> import each other's routes and will learn (auto-discover) the
> current set of active GWs for the SR domain.”
>
> **So if this is the case and we are tracking the underlay RIB and attach a
> route target to all the ingress PE & P next hops which is loopback0 = this is
> literally identical to BGP-LU importing all the loopbacks between domains or
> using CT class** There is no need for this feature to use the tunnel
> encapsulation attribute. I am not following why you would not use BGP-LU or CT
> clas RIB.**
>
> “To avoid the side effect of applying the Tunnel Encapsulation
> attribute to any packet that is addressed to the GW itself, the GW
> SHOULD use a different loopback address for packets intended for it.”
>
> **I don’t understand this statement as the next hop is the ingress and egress
> PE loopback0 that is the next hop being tracked for the gateway load balancing.
> The GW device subnet between the GW and PE is not advertised into the internet
> domain as we do next-hop-self on the PE PE-RR iBGP peering and so the GW to PE
> subnet is not advertised.** Looking at it a second time I think we are
> thinking here BGP-LU inter-as opt c style import of loops between domains and
> so instead of importing the loop0 which carries all packets on the GW device
> use a different loopback GW1 so it does not carry the FEC of all BAU packets
> similar concept utilized in RSVP-TE to VPN mapping "per-vrf TE" concept.
>
> “As described in Section 1, each GW will include a Tunnel
> Encapsulation attribute with the GW encapsulation information for
> each of the SR domain's active GWs (including itself) in every route
> advertised externally to that SR domain. As the current set of
> active GWs changes (due to the addition of a new GW or the failure/
> removal of an existing GW) each externally advertised route will be
> re-advertised with a new Tunnel Encapsulation attribute which
> reflects current set of active GWs.”
>
> **What is the route being advertised externally from the GW. So the routes
> advertised would be all the PE loopback would be advertised from both ingress
> and egress domains into the internet domain and all loopback from the internet
> domain into the ingress and egress domain which could be done via BGP-LU or CT
> RIB – no need do reinvent the wheel and create a new RIB. So BGP-LU or CT RIB
> track the current set of active next hop GWs loopbacks between domains**If you
> do SR-TE stitching then you can do the next-hop self on each PE PE-RR for the
> load balancing and that would work and the load balancing would be to the PE
> loopbacks or if its an end to end SR-TE path using BGP-LU or CT RIB via
> importing all the PE loopbacks between domains the current set of active GWs
> would be tracked via the BGP-LU or CT RIB. So if the active GWs change due to
> GW failures they would be withdrawn from the BGP-LU or CT underlay RIB. No
> need now for the tunnel encapsulation attribute at least for the GW auto
> discovery load balancing**
>
> I think it still maybe possible to retrofit this draft to utilize the CT RIB or
> BGP-LU for the GW load balancing so nothing new has to be designed as far as
> the underlay goes, however maybe the idea of providing some visibility into the
> VPN overlay route to the underlay – maybe their maybe some benefit of using the
> tunnel encapsulation attribute RT import policy to attach to the VPN overlay
> prefixes.
>
> As CT draft provides a complete solution of providing the VPN overlay per VPN
> or per prefix underpinning of the VPN overlay to underlay CT RIB the problem
> statement is completely solved with either the CT draft or BGP-LU.
>
> Minor issues:
> None
>
> Nits/editorial comments:
>
> Please add normative and informative references below.
>
> I would reference as normative and maybe even informative the CT Class draft
> which creates a new transport class and I think this draft can really work well
> in conjunction with use of the CT class to couple the GW RIB created to the CT
> class transport RIB and provide the end to end inter-AS stitching via the PCE
> CC controller. I am one of the co-authors of this draft and I think this draft
> could be coupled with this GW draft to provide the overall goals of selective
> GW load balancing.
>
> https://tools.ietf.org/html/draft-kaliraj-idr-bgp-classful-transport-planes-07
>
> I would also reference this draft for CT class PCEP coloring extension.
>
> https://tools.ietf.org/html/draft-rajagopalan-pcep-rsvp-color-00
>
> As this solution would utilize a centralized controller PCE CC for inter as
> path instantiation for the GW load balancing, I think it would be a good idea
> to reference the PCE CC, H-PCE and Inter-AS PCE and PCE SR extension as
> informative and maybe even normative reference.
>
>
>
> --
> last-call mailing list
> last-call@xxxxxxxx
> https://www.ietf.org/mailman/listinfo/last-call
-- last-call mailing list last-call@xxxxxxxx https://www.ietf.org/mailman/listinfo/last-call