Dear Authors,
Attached is a txt version -gsm update of version 10 that contains a first cut at what I think would be appropriate RFC 2119 SHOULD / MUST language for a specification. I also made some editorial updates to make
the specification clear to the reader. In this thread and on the call we had we talked about changing the ingress & egress domain to ingress & egress site which I made that change as well. Using the word "domain" really makes it confusing as
to which domain is being referred so the change to "site" really helps readability.
Few more questions & thoughts related to the draft for the authors to help in finalizing the draft for publication below:
GW failover: (John Scudder)
GW's will need a local iBGP session for failover. In the scenario where one GW is disconnected from the backbone the draft clearly states that the advertisement of the GW is withdrawn, when the active set of GWs changes
each externally advertised route will be re-advertised with the new tunnel encapsulation attribute union which reflects the current set of active GWs.
In the case of inconsistent routing within the site GW1 can reach GW2, GW1 cannot reach S2. Low probability but entirely possible. Maybe a note in the draft on this scenario may make things worse with blackhole to GW2.
Section 5 - RFC 9012 Tunnel encapsulation attribute BGP Prefix-sid limitations (Alvaro Retana)
SR end to end or at a minimum within an SR domain may not be general use case and maybe limited due to BGP prefix sid sub-tlv can only be used for IPv4/IPv6 labeled unicast AFI/SAFI 1/4 2/4.
We may want to comment in section 5 that use of SR maybe limited and not a general use case. Also does this limitation impact the use of SRv6?
Additional thoughts for the authors.
Does the draft require SR in the backbone or can RSVP-TE be used?
If RSVP-TE can be used, maybe a different name for the identifier should be used and not SR domain identifier.
Section 3 - Is the SR domain identifier value the RT that is attached to the GW auto discovery route?
RFC 4360 is mentioned in section 3 as normative reference, however RFC 5668 4 byte extended community should also be mentioned as normative.
We may want to mention this bleed over of GW routes due to mis-configuration in section 8 - security considerations
Note that if a GW is (mis)configured with a different SR domain
identifier from the other GWs to the same domain then it will not be
auto-discovered by the other GWs (and will not auto-discover the
other GWs). This would result in a GW for another site
receiving only the Tunnel Encapsulation attribute included in the BGP
best route; i.e., the Tunnel Encapsulation attribute of the
(mis)configured GW
or that of the other GWs.
As there may be significant propagation delays with convergence for re-advertisement as the set of active GWs change in cases where the number of AS's is very large over the public internet, maybe that should be mentioned.
Kind Regards
Gyan
On Tue, May 18, 2021 at 4:49 PM John E Drake <jdrake@xxxxxxxxxxx> wrote:
Excellent, thanks so much for your help on this.
Yours Irrespectively,
John
Juniper Business Use Only
> -----Original Message-----
> From: Gyan Mishra <hayabusagsm@xxxxxxxxx>
> Sent: Tuesday, May 18, 2021 4:28 PM
> To: Lars Eggert <lars@xxxxxxxxxx>
> Cc: General Area Review Team <gen-art@xxxxxxxx>; bess@xxxxxxxx; draft-ietf-
> bess-datacenter-gateway.all@xxxxxxxx; last-call@xxxxxxxx
> Subject: Re: [Last-Call] Genart last call review of draft-ietf-bess-datacenter-
> gateway-10
>
> [External Email. Be cautious of content]
>
>
> Hi Lars’s & DC Gateway authors
>
> I will be responding back today to the Gen-Art original email I sent with final
> comments and hope the final comments will help improve the document.
>
> I will also address the comments from John Scudder related to GW failover as
> well as Alvaro’s comments related to tunnel encapsulation attribute BGP prefix
> sid Sub-TLV limitations. Also will add new text recommendations related to RFC
> 2119 MUST / SHOULD language to help improve the document.
>
>
> Thank you
>
> Gyan
> On Tue, May 18, 2021 at 3:31 AM Lars Eggert <lars@xxxxxxxxxx> wrote:
>
> > Gyan, thank you for your review and thank you all for the following
> > discussion. I have entered a No Objection ballot for this document
> > based on the current status of the discussion.
> >
> > Lars
> >
> >
> > > On 2021-4-29, at 8:46, Gyan Mishra via Datatracker
> > > <noreply@xxxxxxxx>
> > wrote:
> > >
> > > Reviewer: Gyan Mishra
> > > Review result: Not Ready
> > >
> > > I am the assigned Gen-ART reviewer for this draft. The General Area
> > > Review Team (Gen-ART) reviews all IETF documents being processed by
> > > the IESG for the IETF Chair. Please treat these comments just like
> > > any other last call comments.
> > >
> > > For more information, please see the FAQ at
> > >
> > >
> <https://urldefense.com/v3/__https://trac.ietf.org/trac/gen/wiki/GenArtfaq__;
> !!NEt6yMaO-gk!RIcJvmiBoFFiuLezPbzRuUXybG_QHD8PujD7pROBUPot5dc9nX-
> rMTiD1THCYZA$ >.
> > >
> > > Document: draft-ietf-bess-datacenter-gateway-??
> > > Reviewer: Gyan Mishra
> > > Review Date: 2021-04-28
> > > IETF LC End Date: 2021-04-29
> > > IESG Telechat date: Not scheduled for a telechat
> > >
> > > Summary:
> > > This document defines a mechanism using the BGP Tunnel Encapsulation
> > > attribute to allow each gateway router to advertise the routes to the
> > > prefixes in the Segment Routing domains to which it provides access,
> > > and also to advertise on behalf of each other gateway to the same
> > > Segment Routing domain.
> > >
> > > This draft needs to provide some more clarity as far as the use case
> > > and
> > where
> > > this would as well as how it would be used and implemented. From
> > reading the
> > > specification it appears there are some technical gaps that exist.
> > > There
> > are
> > > some major issues with this draft. I don’t think this draft is ready yet.
> > >
> > > Major issues:
> > >
> > > Abstract comments:
> > > It is mentioned that the use of Segment Routing within the Data Center.
> > Is
> > > that a requirement for this specification to work as this is
> > > mentioned throughout the draft? Technically I would think the
> > > concept of the
> > discovery
> > > of the gateways is feasible without the requirement of SR within the
> > > Data Center.
> > >
> > > The concept of load balancing is a bigger issue brought up in this
> > > draft
> > as the
> > > problem statement and what this draft is trying to solve which I
> > > will
> > address
> > > in the introduction comments.
> > >
> > > Introduction comments:
> > > In the introduction the use case is expanded much further to any
> > functional
> > > edge AS verbiage below.
> > >
> > > OLD
> > >
> > > “SR may also be operated in other domains, such as access networks.
> > > Those domains also need to be connected across backbone networks
> > > through gateways. For illustrative purposes, consider the Ingress
> > > and Egress SR Domains shown in Figure 1 as separate ASes. The
> > > various ASes that provide connectivity between the Ingress and Egress
> > > Domains could each be constructed differently and use different
> > > technologies such as IP, MPLS with global table routing native BGP to
> > > the edge, MPLS IP VPN, SR-MPLS IP VPN, or SRv6 IP VPN”
> > >
> > > This paragraph expands the use case to any ingress or egress stub
> > > domain
> > Data
> > > Center, Access or any. If that is the case should the draft name
> > > change
> > to
> > > maybe a “stub edge domain services discovery”. As this draft can be
> > used for
> > > any I would not preclude any use case and make the GW discovery open
> > > to
> > be used
> > > for any service GW edge function and change the draft name to
> > > something
> > more
> > > appropriate.
> > >
> > > This paragraph also states for illustrative purposes which is fine
> > > but
> > then it
> > > expands the overlay/underlay use cases. I believe this use case can
> > > only
> > be
> > > used for any technology that has an overlay/underlay which would
> > preclude any
> > > use case with just an underlay global table routing such as what is
> > mentioned
> > > “IP, MPLS with global table routing native BGP to the edge. The IP
> > > or
> > global
> > > table routing would be an issue as this specification requires
> > > setting a
> > RT and
> > > an export/import RT policy for the discover of routes advertised by
> > > the
> > GWs.
> > > As I don’t think this solution from what I can tell would work
> > technically for
> > > global table routing I will update the above paragraph to preclude
> > global table
> > > routing. We can add back in we can figure that out but I don’t
> > > think any public or private operator would change from global table
> > > carrying all
> > BGP
> > > prefixes in the underlay now drastic change to VPN overlay pushing
> > > all
> > the
> > > any-any prefixes into the overlay as that would be a prerequisite to
> > > be
> > able to
> > > use this draft.
> > >
> > >> From this point forward I am going to assume we are using VPN
> > >> overlay
> > > technology such as SR or MPLS.
> > >
> > > NEW
> > >
> > > “SR may also be operated in other domains, such as access networks.
> > > Those domains also need to be connected across backbone networks
> > > through gateways. For illustrative purposes, consider the Ingress
> > > and Egress SR Domains shown in Figure 1 as separate ASes. The
> > > various ASs that provide connectivity between the Ingress and Egress
> > > Domains could be two as shown in Figure-1 or could be many more as
> > exists
> > > with the public internet use case, and each may be constructed
> > differently
> > > and use different technologies such as MPLS IP VPN, SR-MPLS IP
> > > VPN, or
> > SRv6
> > > IP VPN” with a “BGP Free” Core.
> > >
> > > This may work without “BGP Free” core but I think to simplify the
> > > design complexity I think constraining to “BGP Free” core transport layer.
> > SR-TE path
> > > steering as well gets much more complicated if all P routers are
> > > running
> > BGP as
> > > well. I think in this example we can even explicitly say this
> > > example
> > shows the
> > > public internet as that would be one of the primary use cases.
> > >
> > > This paragraph is confusing to the reader
> > >
> > > As a precursor to this paragraph I think it maybe a good idea to
> > > state
> > that we
> > > are talking global table IP only routing or VPN overlay technology
> > > with
> > SR/MPLS
> > > underlay transport. That will make this section much easier to
> > understand.
> > >
> > > Figure 1 drawing you should give a AS number to both the ingress
> > > domain
> > and
> > > egress domain so the reader does not have to make assumptions if it
> > > iBGP
> > or
> > > eBGP connected to the egress or ingress domain and state eBGP in the
> > > text below. Lets also call the intermediate ASNs in the middle as
> > > depicted
> > in the
> > > diagram could be 2 as shown illustratively but could be many
> > > operator
> > domains
> > > such as in the case of traversing the public internet. In the drawing
> > I would
> > > replace ASBR for PE as per this solution as I am stating it has to
> > > be a
> > VPN
> > > overlay paradigm and not global routing. Also in the VPN overlay
> > scenario when
> > > you are doing any type of inter-as peering the inter-AS peering is
> > > almost always between PE’s and not a separate dedicated device
> > > serving a special “ASBR-ASBR” function as the PE is acting as the
> > > border node providing the “ASBR” type function. So in the re-write
> > > I am assuming the drawing has
> > been
> > > updated changing ASBR to PE. Lets give each node a number so that
> > > we
> > can be
> > > clear in the text exactly what node we are referring to. In the
> > > drawing
> > please
> > > update that GW1 peers to PE1 and GW2 peers to PE2 and GW3 peers to PE3.
> > GW3
> > > also peers to GW4 and GW2 peers to GW5 which GW4 and GW5 are part
> > > of
> > AS3. In
> > > the AS1-AS2 peering top peer would be PE6 peers to PE8 and bottom
> > > peer
> > PE7
> > > peers to PE9. So PE6 and PE7 are in AS1 and PE8 and PE9 are in AS2.
> > > I
> > made
> > > the bottom to ASBRs in AS3 for the selective deterministic load
> > balancing now
> > > calling them GW4 and GW5 used later in the problem statement.
> > >
> > > One major problem with this problem statement description is that it
> > > is incorrect as far as GW load balancing that it does not work today
> > > in the topology given in Figure-1. The function of edge GW load
> > > balancing is
> > based on
> > > the iBGP path tie breaker lowest common denominator in the BGP path
> > selection
> > > which is lowest IGP underlay metric and as long as the metric is
> > > equal
> > and you
> > > have iBGP multipath enabled you now can load balance to egress PE1
> > > and
> > PE2
> > > endpoints. So in this case flows coming from AS1 into AS2 hit a P
> > intermediate
> > > router which has iBGP multipath enabled and has lets say equal cost
> > > for
> > route
> > > to the next hop attribute assuming next-hop-self is set so the cost
> > > to
> > > loopback0 on PE1 and cost to loopback0 on PE2 is lets say 10, so now
> > > you
> > have a
> > > BGP multipath. What is required though is the RD has to be unique
> > > in a
> > “BGP
> > > Free” core RR environment where all PE’s route-reflector-clients
> > > peer to
> > the RR
> > > and for all the paths that are advertised to the RR to be reflected
> > > to
> > all the
> > > egress PE edges the RD must be unique for the RR to reflect all paths.
> > BGP
> > > add-paths is only used if you have Primary and Backup routing setup
> > > where
> > > PE1-GW1 has a 0x prepend and PE2-GW2 has 1x prepend so now with BGP
> > add-paths
> > > along with BGP PIC Edge you now have a edge pre-programmed backup
> path.
> > So the
> > > add-paths is not necessarily something that helps for load balancing
> > > and
> > is in
> > > fact orthogonal to load balancing as it for Primary / Backup routing
> > > and
> > not
> > > Active/Active load balancing routing where load balancing with VPN
> > overlay is
> > > simply achieved with unique RD per PE and iBGP multipath and equal
> > > cost
> > paths
> > > to the underlay recursive IGP learned next-hop-attribute in this
> > > case
> > the PE
> > > loopback 0 per the next hop rewrite via “next-hop-sellf” done on the
> > PE-RR
> > > peering in a standard VPN overlay topology. As far as load balancing
> > being
> > > accomplished in the underlay what I have stated is independent of
> > > SR-TE
> > however
> > > with SR-TE candidate path the load balancing ECMP spray to egress PE
> > egress GW
> > > AS can also happen as well with prefix-sid.
> > >
> > > OLD
> > > Suppose that there are two gateways, GW1 and GW2 as shown in
> > > Figure 1, for a given egress SR domain and that they each advertise a
> > > route to prefix X which is located within the egress SR domain with
> > > each setting itself as next hop. One might think that the GWs for X
> > > could be inferred from the routes' next hop fields, but typically it
> > > is not the case that both routes get distributed across the backbone:
> > > rather only the best route, as selected by BGP, is distributed. This
> > > precludes load balancing flows across both GWs.
> > >
> > > I am rewriting the text in the NEW as there is some discrepancy in
> > > the
> > routes
> > > being distributed across the backbone and what gets distributed. So
> > > I am completely re-writing to make it more clear what we are trying
> > > to state
> > here as
> > > the text appears technically to be incorrect. To help state the
> > > flow
> > will use
> > > the BGP route flow to help depict the routing and try to get to the
> > problem
> > > statement we are trying to portray.
> > >
> > > NEW
> > >
> > > Suppose that there are two gateways, GW1 and GW2 as shown in
> > > Figure 1, for a given egress SR domain and each gateway advertises
> > > via
> > EBGP
> > > a VPN prefix X to AS2 core domain via EBGP with underlay next hop
> > > set
> > to GW1
> > > or GW2. In this case we are Active / Active load balancing with
> > > PE1
> > and PE2
> > > receives the VPN prefix and advertised the VPN prefix X into the
> > domain with
> > > next-hop-self set on the PE-RR peering to the PE’s loopback0. The
> > > P
> > routers
> > > within the domain have ECMP path with IGP metric tie to the egress
> > > PE1
> > and
> > > egress PE2 for VPN Prefix X learned from GW1 and GW2. SR-TE path
> > > can
> > now be
> > > stitched from GW3 to PE3 SR-TE Segment-1 to PE3 to PE6 and PE7
> > Segment-2 to
> > > PE8 and PE9 to Egress Domain via PE1 and PE2 to GW1 and GW2. In
> > > this
> > case
> > > however we don’t want the traffic to be steered via SR-TE Load
> > balanced via
> > > ingress GW3 and want to take GW3 out of rotation and load balance
> > traffic to
> > > GW4 and GW5 instead.
> > >
> > > **Text above provides the updated selective deterministic gateway
> > steering
> > > described below to achieve the goal. I think that may have been the
> > intent of
> > > the authors and I am just making it more clear**
> > >
> > > As for problem statement as GW load balancing can occur in the
> > > underlay
> > as
> > > stated easily that is not the problem.
> > >
> > > In my mind I am thinking the problem statement that we want to
> > > describe
> > in both
> > > the Abstract and Introduction is not vanilla simple gateway load
> > balancing but
> > > rather a predictable deterministic method of selecting gateways to
> > > be
> > used that
> > > is each VPN prefix now has a descriptor attached - tunnel
> > > encapsulation attribute which contains multiple TLVs one or more for
> > > each “selected
> > gateway”
> > > with each tunnel TLV contains an egress tunnel endpoint sub-tlv that
> > identifies
> > > the gateway for the tunnel. Maybe we can have in the sub-tlv a
> > > priority
> > field
> > > for pecking order preference of which GWs are pushed up into the GW hash
> > > selected for the SR-ERO path to be stitched end to end. So lets say
> > you had
> > > 10 GWs and you break them up into 2 tiers or multi tiers and have
> > > maybe
> > gateway
> > > 1-5 are primary and 6-10 are backup and that could be do to various
> > reasons so
> > > you can basically pick and choose based on priority which GW that
> > > gets
> > added to
> > > the GW hash.
> > >
> > > I have some feedback and comments on the solution and how best to
> > > write
> > the
> > > verbiage to make it more clear to the reader.
> > >
> > > I think in the solution as far s the RT to attach for the GW auto
> > discovery.
> > > So with this new RT we are essentially creating a new VPN RIB that
> > > has
> > prefixes
> > > from all the selected gateways that are discovered from the tunnel
> > > encapsulation attribute TLV.
> > >
> > > In the text here what is really confusing is if the tunnel
> > > encapsulation attribute is being attached to the underlay recursive route to
> next hop
> > > attribute or the VPN overlay prefix. So the reason I am thinking it is
> > being
> > > attached to the VPN overlay prefix and not the underlay next hop
> > attribute is
> > > how would you now create another transport RIB and if you are
> > > creating a
> > new
> > > transport RIB there is already a draft defined by Kaliraj
> > > Vairavakkalai
> > or
> > > BGP-LU SAFI 4 labeled unicast that exits today to advertise next
> > > hops
> > between
> > > domains for an end to end LSP load balanced path.
> > >
> > >
> > https://urldefense.com/v3/__https://tools.ietf.org/html/draft-kaliraj-
> > idr-bgp-classful-transport-planes-07__;!!NEt6yMaO-gk!RIcJvmiBoFFiuLezP
> > bzRuUXybG_QHD8PujD7pROBUPot5dc9nX-rMTiD7W4i_nA$
> > >
> > > IANA code point below
> > > 76 Classful-Transport SAFI
> > > [draft-kaliraj-idr-bgp-classful-transport-planes-00]
> > >
> > > Also in line with CT another option is BGP-LU SAFI 4 to import the
> > loopbacks
> > > between domains which is the next hop attribute to be advertised
> > > into
> > the core
> > > end to end LSP. So the BGP-LU SAFI RIB could be used for the next
> > > GW
> > next hop
> > > advertisement between domains so that there is visibility of all the
> > egress PE
> > > loopback0 between domains. So you can either stitch the LSP segmented
> > LSP
> > > like inter-as option-b SR-TE stitched and use nex-hop self PE-RR
> > > next-hop rewrite on each of the PEs within the internet domain or
> > > you could
> > import all
> > > the PE loopback from all ingress and egress domains into the
> > > internet
> > domain
> > > similar to inter-as opt-c create end to end LSP instantiate an end
> > > to
> > end SR-TE
> > > path.
> > >
> > > Maybe you could attach the RT tunnel encapsulation attribute tunnel
> > > tlv endpoint tlv to the VPN overlay prefix. Not sure how that would
> > > be
> > beneficial
> > > the underlay steers the VPN overlay.
> > >
> > > So maybe you could couple the VPN overlay new GW RIB RT to the
> > > transport Underlay CT CLAS RIB or BGP-LU RIB coupling may have some
> > > benefit but
> > that
> > > would have to be investigated but I think is out of scope of the
> > > goals
> > of this
> > > draft.
> > >
> > > I think we first have to figure out the goal and purpose of this
> > > draft
> > by the
> > > authors and how the GW discovery should work in light of the CT
> > > class CT
> > RIB
> > > AFI/SAFI codepoint draft that exists today as well as the BGP-LU
> > > option
> > for
> > > next hop advertisement within the internet domain.
> > >
> > > Section 3 comments
> > >
> > > “Each GW is configured with an identifier for the SR domain. That
> > > identifier is common across all GWs to the domain (i.e., the same
> > > identifier is used by all GWs to the same SR domain), and unique
> > > across all SR domains that are connected (i.e., across all GWs to
> > > all SR domains that are interconnected).
> > >
> > > **No issues with the above**
> > >
> > > A route target ([RFC4360]) is attached to each GW's auto-discovery
> > > route and has its value set to the SR domain identifier.
> > >
> > > **So here if the RT is attached to the GW auto-discovery route we
> > > need
> > to state
> > > is that the underlay route and that the PE does a next-hop-self
> > > rewrite
> > of the
> > > eBGP link to the BGP egress domain next hop to the loopback0 so the
> > > GW
> > next hop
> > > that we are tracking of all the ingress and egress PE domains is the
> > egress and
> > > ingress PE loopback0.**
> > >
> > > Each GW constructs an import filtering rule to import any route
> > > that carries a route target with the same SR domain identifier
> > > that the GW itself uses. This means that only these GWs will
> > > import those routes, and that all GWs to the same SR domain will
> > > import each other's routes and will learn (auto-discover) the
> > > current set of active GWs for the SR domain.”
> > >
> > > **So if this is the case and we are tracking the underlay RIB and
> > > attach
> > a
> > > route target to all the ingress PE & P next hops which is loopback0
> > > =
> > this is
> > > literally identical to BGP-LU importing all the loopbacks between
> > domains or
> > > using CT class** There is no need for this feature to use the tunnel
> > > encapsulation attribute. I am not following why you would not use
> > BGP-LU or CT
> > > clas RIB.**
> > >
> > > “To avoid the side effect of applying the Tunnel Encapsulation
> > > attribute to any packet that is addressed to the GW itself, the GW
> > > SHOULD use a different loopback address for packets intended for it.”
> > >
> > > **I don’t understand this statement as the next hop is the ingress
> > > and
> > egress
> > > PE loopback0 that is the next hop being tracked for the gateway load
> > balancing.
> > > The GW device subnet between the GW and PE is not advertised into
> > > the
> > internet
> > > domain as we do next-hop-self on the PE PE-RR iBGP peering and so
> > > the GW
> > to PE
> > > subnet is not advertised.** Looking at it a second time I think we are
> > > thinking here BGP-LU inter-as opt c style import of loops between
> > domains and
> > > so instead of importing the loop0 which carries all packets on the
> > > GW
> > device
> > > use a different loopback GW1 so it does not carry the FEC of all
> > > BAU
> > packets
> > > similar concept utilized in RSVP-TE to VPN mapping "per-vrf TE"
> > > concept
BESS Working Group A. Farrel Internet-Draft Old Dog Consulting Intended status: Standards Track J. Drake Expires: October 17, 2021 E. Rosen Juniper Networks K. Patel Arrcus, Inc. L. Jalil Verizon April 15, 2021 Gateway Auto-Discovery and Route Advertisement for Segment Routing Enabled Domain Interconnection draft-ietf-bess-datacenter-gateway-10 Abstract Data centers are critical components of the infrastructure used by network operators to provide services to their customers. Data centers are attached to the Internet or a backbone network by gateway routers. One data center typically has more than one gateway for commercial, load balancing, and resiliency reasons. Segment Routing is a protocol mechanism that can be used within a data center, and also for steering traffic that flows between two data center sites. In order that one data center site may load balance the traffic it sends to another data center site, it needs to know the complete set of gateway routers at the remote data center, the points of connection from those gateways to the backbone network, and the connectivity across the backbone network. Segment Routing may also be operated in other sites, such as access networks. Those sites also need to be connected across backbone networks through gateways. This document defines a mechanism using the BGP Tunnel Encapsulation attribute to allow each gateway router to advertise the underlay transport routes for internal prefixes at the site to which it provides access, and also to advertise on behalf of each other, all the gateways within the site to the same Segment Routing domain. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Farrel, et al. Expires October 17, 2021 [Page 1] Internet-Draft SR DC Gateways April 2021 Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on October 17, 2021. Copyright Notice Copyright (c) 2021 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 5 3. SR Domain Gateway Auto-Discovery . . . . . . . . . . . . . . 5 4. Relationship to BGP Link State and Egress Peer Engineering . 7 5. Advertising an SR Domain Route Externally . . . . . . . . . . 7 6. Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . 7 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 8. Security Considerations . . . . . . . . . . . . . . . . . . . 8 9. Manageability Considerations . . . . . . . . . . . . . . . . 9 9.1. Relationship to Route Target Constraint . . . . . . . . . 9 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 11.1. Normative References . . . . . . . . . . . . . . . . . . 10 11.2. Informative References . . . . . . . . . . . . . . . . . 11 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 Farrel, et al. Expires October 17, 2021 [Page 2] Internet-Draft SR DC Gateways April 2021 1. Introduction Data centers (DCs) are critical components of the infrastructure used by network operators to provide services to their customers. DCs are attached to the Internet or a backbone network by gateway routers (GWs). One DC typically has more than one GW for various reasons including commercial preferences, load balancing, or resiliency against connection of device failure. Segment Routing (SR) [RFC8402] is a protocol mechanism that can be used within a DC, and also for steering traffic that flows between two DC sites. In order for a source (ingress) DC that uses SR to load balance the flows it sends to a destination (egress) DC, it needs to know the complete set of entry nodes (i.e., GWs) for that egress DC from the backbone network connecting the two DCs. Note that it is assumed that the connected set of DCs and the backbone network connecting them are part of the same SR BGP Link State (LS) instance ([RFC7752] and [I-D.ietf-idr-bgpls-segment-routing-epe]) so that traffic engineering using SR may be used for these flows. Note that the ingress and egress DC sites do not need to support SR. SR may also be operated in other sites, such as access networks. Those domains also need to be connected across backbone networks through gateways. For illustrative purposes, consider the Ingress and Egress SR Site shown in Figure 1 as separate ASes. The various ASes that provide connectivity between the Ingress and Egress sites could each be constructed differently and use different technologies such as IP, MPLS with global table routing native BGP to the edge, MPLS IP VPN, SR-MPLS IP VPN, or SRv6 IP VPN. Suppose that there are two gateways, GW1 and GW2 as shown in Figure 1, for a given egress SR site and that they each advertise a route to prefix X which is located within the egress SR site with each setting itself as next hop. One might think that the GWs for X could be inferred from the routes' next hop fields, but typically it is not the case that both routes get distributed across the backbone: rather only the best route, as selected by BGP, is distributed. This precludes load balancing flows across both GWs. Farrel, et al. Expires October 17, 2021 [Page 3] Internet-Draft SR DC Gateways April 2021 ----------------- --------------------- | Ingress | | Egress ------ | | SR Site | | SR Site |Prefix| | | | | | X | | | | | ------ | | -- | | --- --- | | |GW| | | |GW1| |GW2| | -------++-------- ----+-----------+-+-- | \ | / | | \ | / | | -+------------- --------+--------+-- | | ||ASBR| ----| |---- |ASBR| |ASBR| | | | | ---- |ASBR+------+ASBR| ---- ---- | | | | ----| |---- | | | | SR Domain | | SR Domain | | | | ----| |---- | | | | AS1 |ASBR+------+ASBR| AS2 | | | | ----| |---- | | | --------------- -------------------- | --+-----------------------------------------------+-- | |ASBR| |ASBR| | | ---- AS3 ---- | | | ----------------------------------------------------- Figure 1: Example Segment Routing Domain Interconnection The obvious solution to this problem is to use the BGP feature that allows the advertisement of multiple paths in BGP (known as Add- Paths) [RFC7911] to ensure that all routes to X get advertised by BGP. However, even if this is done, the identity of the GWs will be lost as soon as the routes get distributed through an Autonomous System Border Router (ASBR) that will set itself to be the next hop. And if there are multiple Autonomous Systems (ASes) in the backbone, not only will the next hop change several times, but the Add-Paths technique will experience scaling issues. This all means that the Add-Paths approach is limited to SR domains connected over a single AS. Please refer to Section 2 [I-D.farrel-spring-sr-domain-interconnect] for the problem statement details. This document defines a solution that overcomes this limitation and works equally well with a backbone constructed from one or more ASes using the Tunnel Encapsulation attribute [I-D.ietf-idr-tunnel-encaps] as follows: When a GW to a given SR site advertises a route to a prefix X within that SR site, it will include a Tunnel Encapsulation attribute that contains the union of the Tunnel Encapsulation Farrel, et al. Expires October 17, 2021 [Page 4] Internet-Draft SR DC Gateways April 2021 attributes advertised by each of the GWs to that SR domain, including itself. In other words, each route advertised by a GW identifies all of the GWs to the same SR site (see Section 3 for a discussion of how GWs discover each other). I.e., the Tunnel Encapsulation attribute advertised by each GW contains multiple Tunnel TLVs, one or more from each active GW, and each Tunnel TLV MUST contain a Tunnel Egress Endpoint Sub-TLV that identifies the GW for that Tunnel TLV. Therefore, even if only one of the routes is distributed to other ASes, it will not matter how many times the next hop changes, as the Tunnel Encapsulation attribute will remain unchanged. To put this in the context of Figure 1, GW1 and GW2 discover each other as gateways for the egress SR site. Both GW1 and GW2 advertise themselves as having routes to prefix X. Furthermore, GW1 includes a Tunnel Encapsulation attribute which is the union of its Tunnel Encapsulation attribute and GW2's Tunnel Encapsulation attribute. Similarly, GW2 includes a Tunnel Encapsulation attribute which is the union of its Tunnel Encapsulation attribute and GW1's Tunnel Encapsulation attribute. The gateway in the ingress SR site can now see all possible paths to X in the egress SR site regardless of which route is propagated to it, and it can choose one, or balance traffic flows as it sees fit. The solution defined in this document can be seen in the broader context of SR domain interconnection in [I-D.farrel-spring-sr-domain-interconnect]. That document shows how other existing protocol elements may be combined with the solution defined in this document to provide a full system, but is not a necessary reference for understanding this document. 2. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 3. SR Domain Gateway Auto-Discovery To allow a given SR domain's GWs to auto-discover each other and to coordinate their operations, the following procedure MUST be implemented: o Each GW that connects to the same SR domain MUST be configured with an SR domain identifier which MUST be identical across all GWs that connect to that domain. The SR domain identifier MUST be unique across all SR domains that are connected (i.e., across all GWs to all SR domains that are interconnected). Farrel, et al. Expires October 17, 2021 [Page 5] Internet-Draft SR DC Gateways April 2021 o A route target ([RFC4360]) MUST be attached to each GW's auto-discovery route with its value set to the SR domain identifier. o Each GW MUST construct an import filtering rule to import any route that carries a route target with the same SR domain identifier that the GW itself exports. Only the set GWs with the SR domain identifier will import the transport layer underlay auto-discovery route, and all GWs to the same site will import each other's routes and will learn (auto-discover) the current set of active GWs for the site. The auto-discovery route that each GW advertises consists of the following: o An IPv4 or IPv6 NLRI containing one of the GW's loopback addresses (that is, with an AFI/SAFI pair that is one of 1/1, 2/1, 1/4, or 2/4). o A Tunnel Encapsulation attribute [I-D.ietf-idr-tunnel-encaps] containing the GW's encapsulation information encoded in one or more Tunnel TLVs. To avoid the side effect of applying the Tunnel Encapsulation attribute to any packet that is addressed to the GW itself, the GW MUST use a different loopback address for packets intended for it. As described in Section 1, each GW will include a Tunnel Encapsulation attribute with the GW encapsulation information for each of the sites active GWs (including itself) in every route advertised externally to the site. As the current set of active GWs changes (due to the addition of a new GW or the failure/ removal of an existing GW) each externally advertised route will be re-advertised with a new Tunnel Encapsulation attribute which reflects current set of active GWs. If a gateway becomes disconnected from the backbone network, or if the SR site operator decides to terminate the gateway's activity, it MUST withdraw the advertisements described above. This means that remote gateways at other sites will stop seeing advertisements from this gateway. Note that if a GW is (mis)configured with a different SR domain identifier from the other GWs to the same domain then it will not be auto-discovered by the other GWs (and will not auto-discover the other GWs). This would result in a GW for another site receiving only the Tunnel Encapsulation attribute included in the BGP best route; i.e., the Tunnel Encapsulation attribute of the (mis)configured GW or that of the other GWs. Farrel, et al. Expires October 17, 2021 [Page 6] Internet-Draft SR DC Gateways April 2021 4. Relationship to BGP Link State and Egress Peer Engineering When a remote GW receives a route to a prefix X it uses the Tunnel Egress Endpoint Sub-TLVs in the containing Tunnel Encapsulation attribute to identify the GWs through which X can be reached. It uses this information to compute SR Traffic Engineering (SR TE) paths across the backbone network looking at the information advertised to it in SR BGP Link State (BGP-LS) [I-D.ietf-idr-bgp-ls-segment-routing-ext] and correlated using the SR domain identity. SR Egress Peer Engineering (EPE) [I-D.ietf-idr-bgpls-segment-routing-epe] can be used to supplement the information advertised in BGP-LS. 5. Advertising an SR Domain Route Externally When a packet destined for prefix X is sent on an SR TE path to a GW for the site containing X (that is, the packet is sent in the Ingress site on an SR TE path that describes the path including within the Egress site), it needs to carry the receiving GW's label for X such that this label rises to the top of the stack before the GW completes its processing of the packet. To achieve this, each Tunnel TLV in the Tunnel Encapsulation attribute contains a Prefix SID sub-TLV [I-D.ietf-idr-tunnel-encaps] for X. Alternatively, if the GWs for a given site are configured to allow remote GWs to perform SR TE through that SR domain for a prefix X, then each GW computes an SR TE path through that site to X from each of the currently active GWs, and places each in an MPLS label stack sub-TLV [I-D.ietf-idr-tunnel-encaps] in the SR Tunnel TLV for that GW. Please refer to Section 7 of [I-D.farrel-spring-sr-domain-interconnect] for worked examples of how the label stack is constructed in this case, and how the advertisements would work. 6. Encapsulation If the GWs for a given site are configured to allow remote GWs to send them a packet in that sites native encapsulation, then each GW will also include multiple instances of a Tunnel TLV for that native encapsulation in externally advertised routes: one for each GW and each containing a Tunnel Egress Endpoint sub-TLV with that GW's Farrel, et al. Expires October 17, 2021 [Page 7] Internet-Draft SR DC Gateways April 2021 address. A remote GW may then encapsulate a packet according to the rules defined via the sub-TLVs included in each of the Tunnel TLVs. 7. IANA Considerations IANA maintains a registry called "Border Gateway Protocol (BGP) Parameters" with a sub-registry called "BGP Tunnel Encapsulation Attribute Tunnel Types." The registration policy for this registry is First-Come First-Served [RFC8126]. IANA previously assigned the value 17 from this sub-registry for "SR Tunnel", referencing this document. IANA is now requested to mark that assignment as deprecated. IANA may reclaim that codepoint at such a time that the registry is depleted. 8. Security Considerations From a protocol point of view, the mechanisms described in this document can leverage the security mechanisms already defined for BGP. Further discussion of security considerations for BGP may be found in the BGP specification itself [RFC4271] and in the security analysis for BGP [RFC4272]. The original discussion of the use of the TCP MD5 signature option to protect BGP sessions is found in [RFC5925], while [RFC6952] includes an analysis of BGP keying and authentication issues. The mechanisms described in this document involve sharing routing or reachability information between domains: that may mean disclosing information that is normally contained within a domain. So it needs to be understood that normal security paradigms based on the boundaries of domains are weakened. Discussion of these issues with respect to VPNs can be found in [RFC4364], while [RFC7926] describes many of the issues associated with the exchange of topology or TE information between domains. Particular exposures resulting from this work include: o Gateways to an SR domain will know about all other gateways to the same domain. This feature applies within a domain and so is not a substantial exposure, but it does mean that if the BGP exchanges within a domain can be snooped or if a gateway can be subverted then an attacker may learn the full set of gateways to a domain. This would facilitate more effective attacks on that domain. o The existence of multiple gateways to a domain becomes more visible across the backbone and even into remote domains. This means that an attacker is able to prepare a more comprehensive attack than exists when only the locally attached backbone network Farrel, et al. Expires October 17, 2021 [Page 8] Internet-Draft SR DC Gateways April 2021 (e.g., the AS that hosts the domain) can see all of the gateways to a site. For example, a Denial of Service attack on a single GW is mitigated by the existence of other GWs, but if the attacker knows about all the gateways then the whole set can be attacked at once. o A node in a domain that does not have external BGP peering (i.e., is not really a domain gateway and cannot speak BGP into the backbone network) may be able to get itself advertised as a gateway by letting other genuine gateways discover it (by speaking BGP to them within the domain) and so may get those genuine gateways to advertise it as a gateway into the backbone network. This would allow the malicious node to attract traffic without having to have secure BGP peerings with out-of-domain nodes. o If it is possible to modify a BGP message within the backbone, it may be possible to spoof the existence of a gateway. This could cause traffic to be attracted to a specific node and might result in black-holing of traffic. All of the issues in the list above could cause disruption to domain interconnection, but are not new protocol vulnerabilities so much as new exposures of information that SHOULD be protected against using existing protocol mechanisms. Furthermore, it is a general observation that if these attacks are possible then it is highly likely that far more significant attacks can be made on the routing system. It should be noted that BGP peerings are not discovered, but always arise from explicit configuration. 9. Manageability Considerations The principal configuration item added by this solution is the allocation of an SR domain identifier. The same identifier MUST be assigned to every GW to the same domain, and each domain MUST have a different identifier. This requires coordination, probably through a central management agent. It should be noted that BGP peerings are not discovered, but always arise from explicit configuration. This is no different from any other BGP operation. 9.1. Relationship to Route Target Constraint In order to limit the VPN routing information that is maintained at a given route reflector, [RFC4364] suggests the use of "Cooperative Route Filtering" [RFC5291] between route reflectors. [RFC4684] defines an extension to that mechanism to include support for multiple autonomous systems and asymmetric VPN topologies such as Farrel, et al. Expires October 17, 2021 [Page 9] Internet-Draft SR DC Gateways April 2021 hub-and-spoke. The mechanism in RFC 4684 is known as Route Target Constraint (RTC). An operator would not normally configure RTC by default for any AFI/ SAFI combination, and would only enable it after careful consideration. When using the mechanisms defined in this document, the operator should consider carefully the effects of filtering routes. In some cases this may be desirable, and in others it could limit the effectiveness of the procedures. 10. Acknowledgements Thanks to Bruno Rijsman, Stephane Litkowski, Boris Hassanov, Linda Dunbar, Ravi Singh, and Gyan Mishra for review comments, and to Robert Raszuk for useful discussions. 11. References 11.1. Normative References [I-D.ietf-idr-bgpls-segment-routing-epe] Previdi, S., Talaulikar, K., Filsfils, C., Patel, K., Ray, S., and J. Dong, "BGP-LS extensions for Segment Routing BGP Egress Peer Engineering", draft-ietf-idr-bgpls- segment-routing-epe-19 (work in progress), May 2019. [I-D.ietf-idr-tunnel-encaps] Patel, K., Velde, G., Sangli, S., and J. Scudder, "The BGP Tunnel Encapsulation Attribute", draft-ietf-idr-tunnel- encaps-21 (work in progress), January 2021. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>. [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, DOI 10.17487/RFC4271, January 2006, <https://www.rfc-editor.org/info/rfc4271>. [RFC4360] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended Communities Attribute", RFC 4360, DOI 10.17487/RFC4360, February 2006, <https://www.rfc-editor.org/info/rfc4360>. [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP Authentication Option", RFC 5925, DOI 10.17487/RFC5925, June 2010, <https://www.rfc-editor.org/info/rfc5925>. Farrel, et al. Expires October 17, 2021 [Page 10] Internet-Draft SR DC Gateways April 2021 [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and S. Ray, "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information Using BGP", RFC 7752, DOI 10.17487/RFC7752, March 2016, <https://www.rfc-editor.org/info/rfc7752>. [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>. 11.2. Informative References [I-D.farrel-spring-sr-domain-interconnect] Farrel, A. and J. Drake, "Interconnection of Segment Routing Domains - Problem Statement and Solution Landscape", draft-farrel-spring-sr-domain-interconnect-05 (work in progress), October 2018. [I-D.ietf-idr-bgp-ls-segment-routing-ext] Previdi, S., Talaulikar, K., Filsfils, C., Gredler, H., and M. Chen, "BGP Link-State extensions for Segment Routing", draft-ietf-idr-bgp-ls-segment-routing-ext-16 (work in progress), June 2019. [RFC4272] Murphy, S., "BGP Security Vulnerabilities Analysis", RFC 4272, DOI 10.17487/RFC4272, January 2006, <https://www.rfc-editor.org/info/rfc4272>. [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private Networks (VPNs)", RFC 4364, DOI 10.17487/RFC4364, February 2006, <https://www.rfc-editor.org/info/rfc4364>. [RFC4684] Marques, P., Bonica, R., Fang, L., Martini, L., Raszuk, R., Patel, K., and J. Guichard, "Constrained Route Distribution for Border Gateway Protocol/MultiProtocol Label Switching (BGP/MPLS) Internet Protocol (IP) Virtual Private Networks (VPNs)", RFC 4684, DOI 10.17487/RFC4684, November 2006, <https://www.rfc-editor.org/info/rfc4684>. [RFC5291] Chen, E. and Y. Rekhter, "Outbound Route Filtering Capability for BGP-4", RFC 5291, DOI 10.17487/RFC5291, August 2008, <https://www.rfc-editor.org/info/rfc5291>. [RFC6952] Jethanandani, M., Patel, K., and L. Zheng, "Analysis of BGP, LDP, PCEP, and MSDP Issues According to the Keying and Authentication for Routing Protocols (KARP) Design Guide", RFC 6952, DOI 10.17487/RFC6952, May 2013, <https://www.rfc-editor.org/info/rfc6952>. Farrel, et al. Expires October 17, 2021 [Page 11] Internet-Draft SR DC Gateways April 2021 [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, "Advertisement of Multiple Paths in BGP", RFC 7911, DOI 10.17487/RFC7911, July 2016, <https://www.rfc-editor.org/info/rfc7911>. [RFC7926] Farrel, A., Ed., Drake, J., Bitar, N., Swallow, G., Ceccarelli, D., and X. Zhang, "Problem Statement and Architecture for Information Exchange between Interconnected Traffic-Engineered Networks", BCP 206, RFC 7926, DOI 10.17487/RFC7926, July 2016, <https://www.rfc-editor.org/info/rfc7926>. [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 8126, DOI 10.17487/RFC8126, June 2017, <https://www.rfc-editor.org/info/rfc8126>. [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., Decraene, B., Litkowski, S., and R. Shakir, "Segment Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, July 2018, <https://www.rfc-editor.org/info/rfc8402>. Authors' Addresses Adrian Farrel Old Dog Consulting Email: adrian@xxxxxxxxxxxx John Drake Juniper Networks Email: jdrake@xxxxxxxxxxx Eric Rosen Juniper Networks Email: erosen52@xxxxxxxxx Keyur Patel Arrcus, Inc. Email: keyur@xxxxxxxxxx Farrel, et al. Expires October 17, 2021 [Page 12] Internet-Draft SR DC Gateways April 2021 Luay Jalil Verizon Email: luay.jalil@xxxxxxxxxxx Farrel, et al. Expires October 17, 2021 [Page 13]
-- last-call mailing list last-call@xxxxxxxx https://www.ietf.org/mailman/listinfo/last-call