Re: [Last-Call] Genart last call review of draft-ietf-bess-datacenter-gateway-10

John E Drake <jdrake=40juniper.net@xxxxxxxxxxxxxx> · Tue, 18 May 2021 20:49:07 +0000

Excellent, thanks so much for your help on this.

Yours Irrespectively,

John

Juniper Business Use Only

> -----Original Message-----
> From: Gyan Mishra <hayabusagsm@xxxxxxxxx>
> Sent: Tuesday, May 18, 2021 4:28 PM
> To: Lars Eggert <lars@xxxxxxxxxx>
> Cc: General Area Review Team <gen-art@xxxxxxxx>; bess@xxxxxxxx; draft-ietf-
> bess-datacenter-gateway.all@xxxxxxxx; last-call@xxxxxxxx
> Subject: Re: [Last-Call] Genart last call review of draft-ietf-bess-datacenter-
> gateway-10
> 
> [External Email. Be cautious of content]
> 
> 
> Hi Lars’s  & DC Gateway authors
> 
> I will be responding back today to the Gen-Art original email I sent with final
> comments and hope the final comments will help improve the document.
> 
>     I will also address the comments from John Scudder related to GW failover as
> well as Alvaro’s comments related to tunnel encapsulation attribute BGP prefix
> sid Sub-TLV limitations.  Also will add new text recommendations related to RFC
> 2119 MUST / SHOULD language to help improve the document.
> 
> 
> Thank you
> 
> Gyan
> On Tue, May 18, 2021 at 3:31 AM Lars Eggert <lars@xxxxxxxxxx> wrote:
> 
> > Gyan, thank you for your review and thank you all for the following
> > discussion. I have entered a No Objection ballot for this document
> > based on the current status of the discussion.
> >
> > Lars
> >
> >
> > > On 2021-4-29, at 8:46, Gyan Mishra via Datatracker
> > > <noreply@xxxxxxxx>
> > wrote:
> > >
> > > Reviewer: Gyan Mishra
> > > Review result: Not Ready
> > >
> > > I am the assigned Gen-ART reviewer for this draft. The General Area
> > > Review Team (Gen-ART) reviews all IETF documents being processed by
> > > the IESG for the IETF Chair.  Please treat these comments just like
> > > any other last call comments.
> > >
> > > For more information, please see the FAQ at
> > >
> > >
> <https://urldefense.com/v3/__https://trac.ietf.org/trac/gen/wiki/GenArtfaq__;
> !!NEt6yMaO-gk!RIcJvmiBoFFiuLezPbzRuUXybG_QHD8PujD7pROBUPot5dc9nX-
> rMTiD1THCYZA$ >.
> > >
> > > Document: draft-ietf-bess-datacenter-gateway-??
> > > Reviewer: Gyan Mishra
> > > Review Date: 2021-04-28
> > > IETF LC End Date: 2021-04-29
> > > IESG Telechat date: Not scheduled for a telechat
> > >
> > > Summary:
> > >   This document defines a mechanism using the BGP Tunnel Encapsulation
> > >   attribute to allow each gateway router to advertise the routes to the
> > >   prefixes in the Segment Routing domains to which it provides access,
> > >   and also to advertise on behalf of each other gateway to the same
> > >   Segment Routing domain.
> > >
> > > This draft needs to provide some more clarity as far as the use case
> > > and
> > where
> > > this would as well as how it would be used and implemented.  From
> > reading the
> > > specification it appears there are some technical gaps that exist.
> > > There
> > are
> > > some major issues with this draft. I don’t think this draft is ready yet.
> > >
> > > Major issues:
> > >
> > > Abstract comments:
> > > It is mentioned that the use of Segment Routing within the Data Center.
> > Is
> > > that a requirement for this specification to work as this is
> > > mentioned throughout the draft?  Technically I would think the
> > > concept of the
> > discovery
> > > of the gateways is feasible without the requirement of SR within the
> > > Data Center.
> > >
> > > The concept of load balancing is a bigger issue brought up in this
> > > draft
> > as the
> > > problem statement and what this draft is trying to solve which I
> > > will
> > address
> > > in the introduction comments.
> > >
> > > Introduction comments:
> > > In the introduction the use case is expanded much further to any
> > functional
> > > edge AS verbiage below.
> > >
> > > OLD
> > >
> > >   “SR may also be operated in other domains, such as access networks.
> > >   Those domains also need to be connected across backbone networks
> > >   through gateways.  For illustrative purposes, consider the Ingress
> > >   and Egress SR Domains shown in Figure 1 as separate ASes.  The
> > >   various ASes that provide connectivity between the Ingress and Egress
> > >   Domains could each be constructed differently and use different
> > >   technologies such as IP, MPLS with global table routing native BGP to
> > >   the edge, MPLS IP VPN, SR-MPLS IP VPN, or SRv6 IP VPN”
> > >
> > > This paragraph expands the use case to any ingress or egress stub
> > > domain
> > Data
> > > Center, Access or any.  If that is the case should the draft name
> > > change
> > to
> > > maybe a “stub edge domain services discovery”.  As this draft can be
> > used for
> > > any I would not preclude any use case and make the GW discovery open
> > > to
> > be used
> > > for any service GW edge function and change the draft name to
> > > something
> > more
> > > appropriate.
> > >
> > > This paragraph also states for illustrative purposes which is fine
> > > but
> > then it
> > > expands the overlay/underlay use cases. I believe this use case can
> > > only
> > be
> > > used for any technology that has an overlay/underlay which would
> > preclude any
> > > use case with just an underlay global table routing such as what is
> > mentioned
> > > “IP, MPLS with global table routing native BGP to the edge.  The IP
> > > or
> > global
> > > table routing would be an issue as this specification requires
> > > setting a
> > RT and
> > > an export/import RT policy for the discover of routes advertised by
> > > the
> > GWs.
> > > As I don’t think this solution from what I can tell would work
> > technically for
> > > global table routing I will update the above paragraph to preclude
> > global table
> > > routing.  We can add back in we can figure that out but I don’t
> > > think any public or private operator would change from global table
> > > carrying all
> > BGP
> > > prefixes in the underlay now drastic change to VPN overlay pushing
> > > all
> > the
> > > any-any prefixes into the overlay as that would be a prerequisite to
> > > be
> > able to
> > > use this draft.
> > >
> > >> From this point forward I am going to assume we are using VPN
> > >> overlay
> > > technology such as SR or MPLS.
> > >
> > > NEW
> > >
> > >   “SR may also be operated in other domains, such as access networks.
> > >   Those domains also need to be connected across backbone networks
> > >   through gateways.  For illustrative purposes, consider the Ingress
> > >   and Egress SR Domains shown in Figure 1 as separate ASes.  The
> > >   various ASs that provide connectivity between the Ingress and Egress
> > >   Domains could be two as shown in Figure-1 or could be many more as
> > exists
> > >   with the public internet use case, and each may be constructed
> > differently
> > >   and use different technologies such as MPLS IP VPN, SR-MPLS IP
> > > VPN, or
> > SRv6
> > >   IP VPN” with a “BGP Free” Core.
> > >
> > > This may work without “BGP Free” core but I think to simplify the
> > > design complexity I think constraining to “BGP Free” core transport layer.
> > SR-TE path
> > > steering as well gets much more complicated if all P routers are
> > > running
> > BGP as
> > > well. I think in this example we can even explicitly say this
> > > example
> > shows the
> > > public internet as that would be one of the primary use cases.
> > >
> > > This paragraph is confusing to the reader
> > >
> > > As a precursor to this paragraph I think it maybe a good idea to
> > > state
> > that we
> > > are talking global table IP only routing or VPN overlay technology
> > > with
> > SR/MPLS
> > > underlay transport.  That will make this section much easier to
> > understand.
> > >
> > > Figure 1 drawing you should give a AS number to both the ingress
> > > domain
> > and
> > > egress domain so the reader does not have to make assumptions if it
> > > iBGP
> > or
> > > eBGP connected to the egress or ingress domain and state eBGP in the
> > > text below.  Lets also call the intermediate ASNs in the middle as
> > > depicted
> > in the
> > > diagram could be 2 as shown illustratively but could be many
> > > operator
> > domains
> > > such as in the case of traversing the public internet.   In the drawing
> > I would
> > > replace ASBR for PE as per this solution as I am stating it has to
> > > be a
> > VPN
> > > overlay paradigm and not global routing.  Also in the VPN overlay
> > scenario when
> > > you are doing any type of inter-as peering the inter-AS peering is
> > > almost always between PE’s and not a separate dedicated device
> > > serving a special “ASBR-ASBR” function as the PE is acting as the
> > > border node providing the “ASBR” type function.  So in the re-write
> > > I am assuming the drawing has
> > been
> > > updated changing ASBR to  PE.  Lets give each node a number so that
> > > we
> > can be
> > > clear in the text exactly what node we are referring to.  In the
> > > drawing
> > please
> > > update that GW1 peers to PE1 and GW2 peers to PE2 and GW3 peers to PE3.
> > GW3
> > > also peers to GW4 and GW2 peers  to GW5 which GW4 and GW5 are part
> > > of
> > AS3.  In
> > > the AS1-AS2 peering  top peer would be PE6 peers to PE8 and bottom
> > > peer
> > PE7
> > > peers to PE9.  So PE6 and PE7 are in AS1 and PE8 and PE9 are in AS2.
> > > I
> > made
> > > the bottom to ASBRs in AS3 for the selective deterministic load
> > balancing now
> > > calling them GW4 and GW5 used later in the problem statement.
> > >
> > > One major problem with this problem statement description is that it
> > > is incorrect as far as GW load balancing that it does not work today
> > > in the topology given in Figure-1.  The function of edge GW load
> > > balancing is
> > based on
> > > the iBGP path tie breaker lowest common denominator in the BGP path
> > selection
> > > which is lowest IGP underlay metric and as long as the metric is
> > > equal
> > and you
> > > have iBGP multipath enabled  you now can load balance to egress PE1
> > > and
> > PE2
> > > endpoints. So in this case flows coming from AS1 into AS2 hit a P
> > intermediate
> > > router which has iBGP multipath enabled and has lets say equal cost
> > > for
> > route
> > > to the next hop attribute assuming next-hop-self is set so the cost
> > > to
> > > loopback0 on PE1 and cost to loopback0 on PE2 is lets say 10, so now
> > > you
> > have a
> > > BGP multipath.  What is required though is the RD has to be unique
> > > in a
> > “BGP
> > > Free” core RR environment where all PE’s route-reflector-clients
> > > peer to
> > the RR
> > > and for all the paths that are advertised to the RR to be reflected
> > > to
> > all the
> > > egress PE edges the RD must be unique for the RR to reflect all paths.
> > BGP
> > > add-paths is only used if you have Primary and Backup routing setup
> > > where
> > > PE1-GW1 has a 0x prepend and PE2-GW2 has 1x prepend so now with BGP
> > add-paths
> > > along with BGP PIC Edge you now have a edge pre-programmed backup
> path.
> > So the
> > > add-paths is not necessarily something that helps for load balancing
> > > and
> > is in
> > > fact orthogonal to load balancing as it for Primary / Backup routing
> > > and
> > not
> > > Active/Active load balancing routing where load balancing with VPN
> > overlay is
> > > simply achieved with unique RD per PE and iBGP multipath and equal
> > > cost
> > paths
> > > to the underlay recursive IGP learned next-hop-attribute in this
> > > case
> > the PE
> > > loopback 0 per the next hop rewrite via “next-hop-sellf” done on the
> > PE-RR
> > > peering in a standard VPN overlay topology.   As far as load balancing
> > being
> > > accomplished in the underlay what I have stated is independent of
> > > SR-TE
> > however
> > > with SR-TE candidate path the load balancing ECMP spray to egress PE
> > egress GW
> > > AS can also happen as well with prefix-sid.
> > >
> > > OLD
> > >   Suppose that there are two gateways, GW1 and GW2 as shown in
> > >   Figure 1, for a given egress SR domain and that they each advertise a
> > >   route to prefix X which is located within the egress SR domain with
> > >   each setting itself as next hop.  One might think that the GWs for X
> > >   could be inferred from the routes' next hop fields, but typically it
> > >   is not the case that both routes get distributed across the backbone:
> > >   rather only the best route, as selected by BGP, is distributed.  This
> > >   precludes load balancing flows across both GWs.
> > >
> > > I am rewriting the text in the NEW as there is some discrepancy in
> > > the
> > routes
> > > being distributed across the backbone and what gets distributed.  So
> > > I am completely re-writing to make it more clear what we are trying
> > > to state
> > here as
> > > the text appears technically to be incorrect.  To help state the
> > > flow
> > will use
> > > the BGP route flow to help depict the routing and try to get to the
> > problem
> > > statement we are trying to portray.
> > >
> > > NEW
> > >
> > >   Suppose that there are two gateways, GW1 and GW2 as shown in
> > >   Figure 1, for a given egress SR domain and each gateway advertises
> > > via
> > EBGP
> > >   a VPN prefix X to AS2 core domain via EBGP with underlay next hop
> > > set
> > to GW1
> > >   or GW2. In this case we are Active / Active load balancing with
> > > PE1
> > and PE2
> > >   receives the VPN prefix and advertised the VPN prefix X into the
> > domain with
> > >   next-hop-self set on the PE-RR peering to the PE’s loopback0.  The
> > > P
> > routers
> > >   within the domain have ECMP path with IGP metric tie to the egress
> > > PE1
> > and
> > >   egress PE2 for VPN Prefix X learned from GW1 and GW2. SR-TE path
> > > can
> > now be
> > >   stitched from GW3 to PE3 SR-TE Segment-1 to PE3 to PE6 and PE7
> > Segment-2 to
> > >   PE8 and PE9 to Egress Domain via PE1 and PE2 to GW1 and GW2.  In
> > > this
> > case
> > >   however we don’t want the traffic to be steered via SR-TE Load
> > balanced via
> > >   ingress GW3 and want to take GW3 out of rotation and load balance
> > traffic to
> > >   GW4 and GW5 instead.
> > >
> > > **Text above provides the updated selective deterministic gateway
> > steering
> > > described below to achieve the goal.  I think that may have been the
> > intent of
> > > the authors and I am just making it more clear**
> > >
> > > As for problem statement as GW load balancing can occur in the
> > > underlay
> > as
> > > stated easily that is not the problem.
> > >
> > > In my mind I am thinking the problem statement that we want to
> > > describe
> > in both
> > > the Abstract and Introduction is not vanilla simple gateway load
> > balancing but
> > > rather a predictable deterministic method of selecting gateways to
> > > be
> > used that
> > > is each VPN prefix now has a descriptor attached -  tunnel
> > > encapsulation attribute which contains multiple TLVs one or more for
> > > each “selected
> > gateway”
> > > with each tunnel TLV contains an egress tunnel endpoint sub-tlv that
> > identifies
> > > the gateway for the tunnel.  Maybe we can have in the sub-tlv a
> > > priority
> > field
> > > for pecking order preference of which GWs are pushed up into the GW hash
> > > selected for the SR-ERO path to be stitched end to end.   So lets say
> > you had
> > > 10 GWs and you break them up into 2 tiers or multi tiers and have
> > > maybe
> > gateway
> > > 1-5 are primary and 6-10 are backup and that could be do to various
> > reasons so
> > > you can basically pick and choose based on priority which GW that
> > > gets
> > added to
> > > the GW hash.
> > >
> > > I have some feedback and comments on the solution and how best to
> > > write
> > the
> > > verbiage to make it more clear to the reader.
> > >
> > > I think in the solution as far s the RT to attach for the GW auto
> > discovery.
> > > So with this new RT we are essentially creating a new VPN RIB that
> > > has
> > prefixes
> > > from all the selected gateways that are discovered from the tunnel
> > > encapsulation attribute TLV.
> > >
> > > In the text here what is really confusing is if the tunnel
> > > encapsulation attribute is being attached to the underlay recursive route to
> next hop
> > > attribute or the VPN overlay prefix.   So the reason I am thinking it is
> > being
> > > attached to the VPN overlay prefix and not the underlay next hop
> > attribute is
> > > how would you now create another transport RIB and if you are
> > > creating a
> > new
> > > transport RIB there is already a draft defined by Kaliraj
> > > Vairavakkalai
> > or
> > > BGP-LU SAFI 4 labeled unicast that exits today to advertise next
> > > hops
> > between
> > > domains for an end to end LSP load balanced path.
> > >
> > >
> > https://urldefense.com/v3/__https://tools.ietf.org/html/draft-kaliraj-
> > idr-bgp-classful-transport-planes-07__;!!NEt6yMaO-gk!RIcJvmiBoFFiuLezP
> > bzRuUXybG_QHD8PujD7pROBUPot5dc9nX-rMTiD7W4i_nA$
> > >
> > > IANA code point below
> > > 76      Classful-Transport SAFI
> > > [draft-kaliraj-idr-bgp-classful-transport-planes-00]
> > >
> > > Also in line with CT another option is BGP-LU SAFI 4 to import the
> > loopbacks
> > > between domains which is the next hop attribute to be advertised
> > > into
> > the core
> > > end to end LSP.  So the BGP-LU SAFI  RIB could be used for the next
> > > GW
> > next hop
> > > advertisement between domains so that there is visibility of all the
> > egress PE
> > > loopback0 between domains.   So you can either stitch the LSP segmented
> > LSP
> > > like inter-as option-b SR-TE stitched and use nex-hop self PE-RR
> > > next-hop rewrite on each of the PEs within the internet domain or
> > > you could
> > import all
> > > the PE loopback from all ingress and egress domains into the
> > > internet
> > domain
> > > similar to inter-as opt-c create end to end LSP instantiate an end
> > > to
> > end SR-TE
> > > path.
> > >
> > > Maybe you could attach the RT tunnel encapsulation attribute tunnel
> > > tlv endpoint tlv to the VPN overlay prefix.  Not sure how that would
> > > be
> > beneficial
> > > the underlay steers the VPN overlay.
> > >
> > > So maybe you could couple the VPN overlay new GW RIB RT to the
> > > transport Underlay CT CLAS RIB or BGP-LU RIB coupling  may have some
> > > benefit but
> > that
> > > would have to be investigated but I think is out of scope of the
> > > goals
> > of this
> > > draft.
> > >
> > > I think we first have to figure out the goal and purpose of this
> > > draft
> > by the
> > > authors and how the GW discovery should work in light of the CT
> > > class CT
> > RIB
> > > AFI/SAFI codepoint draft that exists today as well as the BGP-LU
> > > option
> > for
> > > next hop advertisement within the internet domain.
> > >
> > > Section 3 comments
> > >
> > >      “Each GW is configured with an identifier for the SR domain.  That
> > >      identifier is common across all GWs to the domain (i.e., the same
> > >      identifier is used by all GWs to the same SR domain), and unique
> > >      across all SR domains that are connected (i.e., across all GWs to
> > >      all SR domains that are interconnected).
> > >
> > > **No issues with the above**
> > >
> > >      A route target ([RFC4360]) is attached to each GW's auto-discovery
> > >      route and has its value set to the SR domain identifier.
> > >
> > > **So here if the RT is attached to the GW auto-discovery route we
> > > need
> > to state
> > > is that the underlay route and that the PE does a next-hop-self
> > > rewrite
> > of the
> > > eBGP link to the BGP egress domain next hop to the loopback0 so the
> > > GW
> > next hop
> > > that we are tracking of all the ingress and egress PE domains is the
> > egress and
> > > ingress PE loopback0.**
> > >
> > >      Each GW constructs an import filtering rule to import any route
> > >      that carries a route target with the same SR domain identifier
> > >      that the GW itself uses.  This means that only these GWs will
> > >      import those routes, and that all GWs to the same SR domain will
> > >      import each other's routes and will learn (auto-discover) the
> > >      current set of active GWs for the SR domain.”
> > >
> > > **So if this is the case and we are tracking the underlay RIB and
> > > attach
> > a
> > > route target to all the ingress PE & P next hops which is loopback0
> > > =
> > this is
> > > literally identical to BGP-LU importing all the loopbacks between
> > domains or
> > > using CT class** There is no need for this feature to use the tunnel
> > > encapsulation attribute.  I am not following why you would not use
> > BGP-LU or CT
> > > clas RIB.**
> > >
> > >   “To avoid the side effect of applying the Tunnel Encapsulation
> > >   attribute to any packet that is addressed to the GW itself, the GW
> > >   SHOULD use a different loopback address for packets intended for it.”
> > >
> > > **I don’t understand this statement as the next hop is the ingress
> > > and
> > egress
> > > PE loopback0 that is the next hop being tracked for the gateway load
> > balancing.
> > > The GW device subnet between the GW and PE is not advertised into
> > > the
> > internet
> > > domain as we do next-hop-self on the PE PE-RR iBGP peering and so
> > > the GW
> > to PE
> > > subnet is not advertised.**   Looking at it a second time I think we are
> > > thinking here BGP-LU inter-as opt c style import of loops between
> > domains and
> > > so instead of importing the loop0 which carries all packets on the
> > > GW
> > device
> > > use a different loopback GW1 so it does not carry the FEC of all
> > > BAU
> > packets
> > > similar concept utilized in RSVP-TE to VPN mapping "per-vrf TE"
> > > concept
-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call