I am the assigned Gen-ART reviewer for this draft. The General Area
Review Team (Gen-ART) reviews all IETF documents being processed
by the IESG for the IETF Chair.  Please treat these comments just
like any other last call comments.

For more information, please see the FAQ at

<https://trac.ietf.org/trac/gen/wiki/GenArtfaq>.

Document: draft-ietf-bess-datacenter-gateway-??
Reviewer: Gyan Mishra
Review Date: 2021-04-28
IETF LC End Date: 2021-04-29
IESG Telechat date: Not scheduled for a telechat

Summary:
   This document defines a mechanism using the BGP Tunnel Encapsulation
   attribute to allow each gateway router to advertise the routes to the
   prefixes in the Segment Routing domains to which it provides access,
   and also to advertise on behalf of each other gateway to the same
   Segment Routing domain.  

This draft needs to provide some more clarity as far as the use case and where this would as well as how it would be used and implemented.  From reading the specification it appears there are some technical gaps that exist. There are some major issues with this draft. I don’t think this draft is ready yet.


Major issues:

Abstract comments:
It is mentioned that the use of Segment Routing within the Data Center.  Is that a requirement for this specification to work as this is mentioned throughout the draft?  Technically I would think the concept of the discovery of the gateways is feasible without the requirement of SR within the Data Center.  

The concept of load balancing is a bigger issue brought up in this draft as the problem statement and what this draft is trying to solve which I will address in the introduction comments. 

Introduction comments:
In the introduction the use case is expanded much further to any functional edge AS verbiage below.


OLD

   “SR may also be operated in other domains, such as access networks.
   Those domains also need to be connected across backbone networks
   through gateways.  For illustrative purposes, consider the Ingress
   and Egress SR Domains shown in Figure 1 as separate ASes.  The
   various ASes that provide connectivity between the Ingress and Egress
   Domains could each be constructed differently and use different
   technologies such as IP, MPLS with global table routing native BGP to
   the edge, MPLS IP VPN, SR-MPLS IP VPN, or SRv6 IP VPN”

This paragraph expands the use case to any ingress or egress stub domain Data Center, Access or any.  If that is the case should the draft name change to maybe a “stub edge domain services discovery”.  As this draft can be used for any I would not preclude any use case and make the GW discovery open to be used for any service GW edge function and change the draft name to something more appropriate. 

This paragraph also states for illustrative purposes which is fine but then it expands the overlay/underlay use cases. I believe this use case can only be used for any technology that has an overlay/underlay which would preclude any use case with just an underlay global table routing such as what is mentioned “IP, MPLS with global table routing native BGP to
the edge.  The IP or global table routing would be an issue as this specification requires setting a RT and an export/import RT policy for the discover of routes advertised by the GWs.  As I don’t think this solution from what I can tell would work technically for global table routing I will update the above paragraph to preclude global table routing.  We can add back in we can figure that out but I don’t think any public or private operator would change from global table carrying all BGP prefixes in the underlay now drastic change to VPN overlay pushing all the any-any prefixes into the overlay as that would be a prerequisite to be able to use this draft.  

From this point forward I am going to assume we are using VPN overlay technology such as SR or MPLS.

NEW

   “SR may also be operated in other domains, such as access networks.
   Those domains also need to be connected across backbone networks
   through gateways.  For illustrative purposes, consider the Ingress
   and Egress SR Domains shown in Figure 1 as separate ASes.  The
   various ASs that provide connectivity between the Ingress and Egress
   Domains could be two as shown in Figure-1 or could be many more as exists    with the public internet use case, and each may be constructed differently and use different technologies such as MPLS IP VPN, SR-MPLS IP VPN, or SRv6 IP VPN” with a “BGP Free” Core.

This may work without “BGP Free” core but I think to simplify the design complexity I think constraining to “BGP Free” core transport layer.  SR-TE path steering as well gets much more complicated if all P routers are running BGP as well. I think in this example we can even explicitly say this example shows the public internet as that would be one of the primary use cases.

This paragraph is confusing to the reader

As a precursor to this paragraph I think it maybe a good idea to state that we are talking global table IP only routing or VPN overlay technology with SR/MPLS underlay transport.  That will make this section much easier to understand.  

Figure 1 drawing you should give a AS number to both the ingress domain and egress domain so the reader does not have to make assumptions if it iBGP or eBGP connected to the egress or ingress domain and state eBGP in the text below.  Lets also call the intermediate ASNs in the middle as depicted in the diagram could be 2 as shown illustratively but could be many operator domains such as in the case of traversing the public internet.   In the drawing I would replace ASBR for PE as per this solution as I am stating it has to be a VPN overlay paradigm and not global routing.  Also in the VPN overlay scenario when  you are doing any type of inter-as peering the inter-AS peering is almost always between PE’s and not a separate dedicated device serving a special “ASBR-ASBR” function as the PE is acting as the border node providing the “ASBR” type function.  So in the re-write I am assuming the drawing has been updated changing ASBR to  PE.  Lets give each node a number so that we can be clear in the text exactly what node we are referring to.  In the drawing please update that GW1 peers to PE1 and GW2 peers to PE2 and GW3 peers to PE3.  GW3 also peers to GW4 and GW2 peers  to GW5 which GW4 and GW5 are part of AS3.  In the AS1-AS2 peering  top peer would be PE6 peers to PE8 and bottom peer PE7 peers to PE9.  So PE6 and PE7 are in AS1 and PE8 and PE9 are in AS2.  I made the bottom to ASBRs in AS3 for the selective deterministic load balancing now calling them GW4 and GW5 used later in the problem statement.

One major problem with this problem statement description is that it is incorrect as far as GW load balancing that it does not work today in the topology given in Figure-1.  The function of edge GW load balancing is based on the iBGP path tie breaker lowest common denominator in the BGP path selection which is lowest IGP underlay metric and as long as the metric is equal and you have iBGP multipath enabled  you now can load balance to egress PE1 and PE2 endpoints. So in this case flows coming from AS1 into AS2 hit a P intermediate router which has iBGP multipath enabled and has lets say equal cost for route to the next hop attribute assuming next-hop-self is set so the cost to loopback0 on PE1 and cost to loopback0 on PE2 is lets say 10, so now you have a BGP multipath.  What is required though is the RD has to be unique in a “BGP Free” core RR environment where all PE’s route-reflector-clients peer to the RR and for all the paths that are advertised to the RR to be reflected to all the egress PE edges the RD must be unique for the RR to reflect all paths.  BGP add-paths is only used if you have Primary and Backup routing setup where PE1-GW1 has a 0x prepend and PE2-GW2 has 1x prepend so now with BGP add-paths along with BGP PIC Edge you now have a edge pre-programmed backup path.  So the add-paths is not necessarily something that helps for load balancing and is in fact orthogonal to load balancing as it for Primary / Backup routing and not Active/Active load balancing routing where load balancing with VPN overlay is simply achieved with unique RD per PE and iBGP multipath and equal cost paths to the underlay recursive IGP learned next-hop-attribute in this case the PE loopback 0 per the next hop rewrite via “next-hop-sellf” done on the PE-RR peering in a standard VPN overlay topology.   As far as load balancing being accomplished in the underlay what I have stated is independent of SR-TE however with SR-TE candidate path the load balancing ECMP spray to egress PE egress GW AS can also happen as well with prefix-sid.


OLD
   Suppose that there are two gateways, GW1 and GW2 as shown in
   Figure 1, for a given egress SR domain and that they each advertise a
   route to prefix X which is located within the egress SR domain with
   each setting itself as next hop.  One might think that the GWs for X
   could be inferred from the routes' next hop fields, but typically it
   is not the case that both routes get distributed across the backbone:
   rather only the best route, as selected by BGP, is distributed.  This
   precludes load balancing flows across both GWs.


I am rewriting the text in the NEW as there is some discrepancy in the routes being distributed across the backbone and what gets distributed.  So I am completely re-writing to make it more clear what we are trying to state here as the text appears technically to be incorrect.  To help state the flow will use the BGP route flow to help depict the routing and try to get to the problem statement we are trying to portray. 

NEW

   Suppose that there are two gateways, GW1 and GW2 as shown in
   Figure 1, for a given egress SR domain and each gateway advertises via EBGP a VPN prefix X to AS2 core domain via EBGP with underlay next hop set to GW1 or GW2. In this case we are Active / Active load balancing with PE1 and PE2 receives the VPN prefix and advertised the VPN prefix X into the domain with next-hop-self set on the PE-RR peering to the PE’s loopback0.  The P routers within the domain have ECMP path with IGP metric tie to the egress PE1 and egress PE2 for VPN Prefix X learned from GW1 and GW2. SR-TE path can now be stitched from GW3 to PE3 SR-TE Segment-1 to PE3 to PE6 and PE7 Segment-2 to PE8 and PE9 to Egress Domain via PE1 and PE2 to GW1 and GW2.  In this case however we don’t want the traffic to be steered via SR-TE Load balanced via ingress GW3 and want to take GW3 out of rotation and load balance traffic to GW4 and GW5 instead.  

**Text above provides the updated selective deterministic gateway steering described below to achieve the goal.  I think that may have been the intent of the authors and I am just making it more clear**


As for problem statement as GW load balancing can occur in the underlay as stated easily that is not the problem.

In my mind I am thinking the problem statement that we want to describe in both the Abstract and Introduction is not vanilla simple gateway load balancing but rather a predictable deterministic method of selecting gateways to be used that is each VPN prefix now has a descriptor attached -  tunnel encapsulation attribute which contains multiple TLVs one or more for each “selected gateway”  with each tunnel TLV contains an egress tunnel endpoint sub-tlv that identifies the gateway for the tunnel.  Maybe we can have in the sub-tlv a priority field for pecking order preference of which GWs are pushed up into the GW hash selected for the SR-ERO path to be stitched end to end.   So lets say you had 10 GWs and you break them up into 2 tiers or multi tiers and have maybe gateway 1-5 are primary and 6-10 are backup and that could be do to various reasons so you can basically pick and choose based on priority which GW that gets added to the GW hash.

I have some feedback and comments on the solution and how best to write the verbiage to make it more clear to the reader.

I think in the solution as far s the RT to attach for the GW auto discovery.  So with this new RT we are essentially creating a new VPN RIB that has prefixes from all the selected gateways that are discovered from the tunnel encapsulation attribute TLV.  

In the text here what is really confusing is if the tunnel encapsulation attribute is being attached to the underlay recursive route to next hop attribute or the VPN overlay prefix.   So the reason I am thinking it is being attached to the VPN overlay prefix and not the underlay next hop attribute is how would you now create another transport RIB and if you are creating a new transport RIB there is already a draft defined by Kaliraj Vairavakkalai or BGP-LU SAFI 4 labeled unicast that exits today to advertise next hops between domains for an end to end LSP load balanced path.

https://tools.ietf.org/html/draft-kaliraj-idr-bgp-classful-transport-planes-07

IANA code point below
76	Classful-Transport SAFI	[draft-kaliraj-idr-bgp-classful-transport-planes-00]

Also in line with CT another option is BGP-LU SAFI 4 to import the loopbacks between domains which is the next hop attribute to be advertised into the core end to end LSP.  So the BGP-LU SAFI  RIB could be used for the next GW next hop advertisement between domains so that there is visibility of all the egress PE loopback0 between domains.   So you can either stitch the LSP segmented LSP like inter-as option-b SR-TE stitched and use nex-hop self PE-RR next-hop rewrite on each of the PEs within the internet domain or you could import all the PE loopback from all ingress and egress domains into the internet domain similar to inter-as opt-c create end to end LSP instantiate an end to end SR-TE path.

Maybe you could attach the RT tunnel encapsulation attribute tunnel tlv endpoint tlv to the VPN overlay prefix.  Not sure how that would be beneficial the underlay steers the VPN overlay. 

So maybe you could couple the VPN overlay new GW RIB RT to the transport Underlay CT CLAS RIB or BGP-LU RIB coupling  may have some benefit but that would have to be investigated but I think is out of scope of the goals of this draft.

I think we first have to figure out the goal and purpose of this draft by the authors and how the GW discovery should work in light of the CT class CT RIB AFI/SAFI codepoint draft that exists today as well as the BGP-LU option for next hop advertisement within the internet domain. 

Section 3 comments

      “Each GW is configured with an identifier for the SR domain.  That
      identifier is common across all GWs to the domain (i.e., the same
      identifier is used by all GWs to the same SR domain), and unique
      across all SR domains that are connected (i.e., across all GWs to
      all SR domains that are interconnected).

**No issues with the above**

      A route target ([RFC4360]) is attached to each GW's auto-discovery
      route and has its value set to the SR domain identifier.

**So here if the RT is attached to the GW auto-discovery route we need to state is that the underlay route and that the PE does a next-hop-self rewrite of the eBGP link to the BGP egress domain next hop to the loopback0 so the GW next hop that we are tracking of all the ingress and egress PE domains is the egress and ingress PE loopback0.** 


      Each GW constructs an import filtering rule to import any route
      that carries a route target with the same SR domain identifier
      that the GW itself uses.  This means that only these GWs will
      import those routes, and that all GWs to the same SR domain will
      import each other's routes and will learn (auto-discover) the
      current set of active GWs for the SR domain.”

**So if this is the case and we are tracking the underlay RIB and attach a route target to all the ingress PE & P next hops which is loopback0 = this is literally identical to BGP-LU importing all the loopbacks between domains or using CT class** There is no need for this feature to use the tunnel encapsulation attribute.  I am not following why you would not use BGP-LU or CT clas RIB.**

   “To avoid the side effect of applying the Tunnel Encapsulation
   attribute to any packet that is addressed to the GW itself, the GW
   SHOULD use a different loopback address for packets intended for it.”

**I don’t understand this statement as the next hop is the ingress and egress PE loopback0 that is the next hop being tracked for the gateway load balancing. The GW device subnet between the GW and PE is not advertised into the internet domain as we do next-hop-self on the PE PE-RR iBGP peering and so the GW to PE subnet is not advertised.**   Looking at it a second time I think we are thinking here BGP-LU inter-as opt c style import of loops between domains and so instead of importing the loop0 which carries all packets on the GW device use a different loopback GW1 so it does not carry the FEC of all  BAU packets similar concept utilized in RSVP-TE to VPN mapping "per-vrf TE" concept.


   “As described in Section 1, each GW will include a Tunnel
   Encapsulation attribute with the GW encapsulation information for
   each of the SR domain's active GWs (including itself) in every route
   advertised externally to that SR domain.  As the current set of
   active GWs changes (due to the addition of a new GW or the failure/
   removal of an existing GW) each externally advertised route will be
   re-advertised with a new Tunnel Encapsulation attribute which
   reflects current set of active GWs.”

**What is the route being advertised externally from the GW.  So the routes advertised would be all the PE loopback would be advertised from both ingress and egress domains into the internet domain and all loopback from the internet domain into the ingress and egress domain which could be done via BGP-LU or CT RIB – no need do reinvent the wheel and create a new RIB.  So BGP-LU or CT RIB track the current set of active next hop GWs loopbacks between domains**If you do SR-TE stitching then you can do the next-hop self on each PE PE-RR for the load balancing and that would work and the load balancing would be to the PE loopbacks or if its an end to end SR-TE path using BGP-LU or CT RIB via importing all the PE loopbacks between domains the current set of active GWs would be tracked via the BGP-LU or CT RIB.  So if the active GWs change due to GW failures they would be withdrawn from the BGP-LU or CT underlay RIB.  No need now for the tunnel encapsulation attribute at least for the GW auto discovery load balancing**

I think it still maybe possible to retrofit this draft to utilize the CT RIB or BGP-LU for the GW load balancing so nothing new has to be designed as far as the underlay goes, however maybe the idea of providing some visibility into the VPN overlay route to the underlay – maybe their maybe some benefit of using the tunnel encapsulation attribute RT import policy to attach to the VPN overlay prefixes.  

As CT draft provides a complete solution of providing the VPN overlay per VPN or per prefix underpinning of the VPN overlay to underlay CT RIB the problem statement is completely solved with either the CT draft or BGP-LU.


Minor issues:
None

Nits/editorial comments:

Please add normative and informative references below.

I would reference as normative and maybe even informative the CT Class draft which creates a new transport class and I think this draft can really work well in conjunction with use of the CT class to couple the GW RIB created to the CT class transport RIB and provide the end to end inter-AS stitching via the PCE CC controller.  I am one of the co-authors of this draft and I think this draft could be coupled with this GW draft to provide the overall goals of selective GW load balancing. 

https://tools.ietf.org/html/draft-kaliraj-idr-bgp-classful-transport-planes-07

I would also reference this draft for CT class PCEP coloring extension.

https://tools.ietf.org/html/draft-rajagopalan-pcep-rsvp-color-00


As this solution would utilize a centralized controller PCE CC for inter as path instantiation for the GW load balancing, I think it would be a good idea to reference the PCE CC, H-PCE and Inter-AS PCE and PCE SR extension as informative and maybe even normative reference.