It is easy to route around broken links, in a well designed DC we can reroute (really re-forward) within 50ms after the failure has been detected without FRR (aka fast-rehash). Gray failures are a very different kind of animals, while you might think - how often do the optics degrade or fibers get dirty, at a large scale this is a constant problem, and it takes time (minutes if we are lucky) to detect and safely isolate. While it might be not to someone’ taste - it is a working solution for the aforementioned problem for large DC fabrics. Cheers, Jeff > On Aug 6, 2021, at 18:48, Toerless Eckert <tte@xxxxxxxxx> wrote: > > [bitching] > I apologize for attempting to respond to the original post topic instead > of derailing the thread into my pet side topic without changing subject, > which seems to be expected behavior on ietf@xxxxxxxx. > [/bitching] > > Adding ipv6@xxxxxxxx as that seems to be the closest WG list for the topic. > > Brian reminded us that we have ample RFCs to elaborate on the fact that you can not > reasonably expect for connections to an anycast address to work when persistently > using the anycast address. > > Christian pointed out how QUIC does the right thing. Great! Maybe we should > have a an anycast support hall of fame and shame for protocols: DOes or does it > not support single round-trip resolution of anycast to unicast address. > > But back to what seems to be the root cause, which isn't anycast, but IPv6 > flow label "abuse" ?! > > I specifically had not heard of this Linux "hack" to change flow-label > mid-connection after TCP RTO to overcome a seemingly broken path and hope for > the new flow-label to pick another, working path (most likely in a data cener). > > I do not think that this endpoint behavior is endorsed by RFC6437 and by absence > of a description in RFC6437 under which circumstances an endpoint could or should > change the Flow label for an in-progress transport connection, i would conclude > that this linux behavior is not in compliance with how the normative RFC6437 > describes proper host assignment of flow label. > > Could someone who sleeps with RFC6437 under his pillow comment on whether > or not my assessment is accurate ? I think to remember Bob mentioned he just > carefully read through all those flow label RFCs... > > Of course, pointing out to linux that what it does is a hack would not make > TCP to an anycast address any less of a hack. > > Ultimately, i have not a lot of sympathy for the linux behavior, even if it > was blessed by RFC6437, because i think good networks should fix broken paths > fast enough for this hack to be not necessary... > > Cheers > Toerless > >> On Tue, Aug 03, 2021 at 10:45:29PM +1200, Brian Carpenter wrote: >> The issue of anycast and unstable routes is hardly a new discovery; this >> Linux feature is not creating a new problem. I suggest reading RFC7094 and >> RFC4786 before continuing this conversation. >> >> I certainly wouldn't design a protocol that relied on stable transport >> connections to an anycast address. >> >> Regards, >> Brian Carpenter >> (via tiny screen & keyboard) >> >> On Tue, 3 Aug 2021, 22:10 Michael Tuexen, <michael.tuexen@xxxxxxxxxxxxxxxxx> >> wrote: >> >>>> On 3. Aug 2021, at 11:44, Vasilenko Eduard <vasilenko.eduard@xxxxxxxxxx> >>> wrote: >>>> >>>> Hi all, >>>> I am writing to this alias because I do not know the proper one for such >>> type of a problem (OS/LINUX/BSD). >>>> The history of how Alexander Azimov (Yandex) has found the problem is >>> below. >>>> >>>> In short: if TCP loses connectivity for 200ms (or 1s in SYN stage) then >>> TCP changes IPv6 flow label (for the active TCP session!) to push traffic >>> to a different path. >>>> Current networks are extensively ECMP, if intermediate nodes support >>> flow label for hash calculation then a high probability that the path would >>> be changed. >>>> LINUX/BSD does not want to wait till the network will fix its problem. >>> As far as I know, Linux implements something you describe, but I'm not >>> aware on this behaviour being >>> implemented in *BSD, at least not in FreeBSD. >>>> >>>> If the final destination was anycast then the final destination would be >>> changed too by the same hash calculation. >>>> The stateful session would be broken as a result (see the second part of >>> Alexander’s presentation below). >>>> >>>> Since the time LINUX has made the default RTO flow label recalculation >>> (2016), IPv6 Anycast is broken. >>>> People would have one more reason not to migrate to IPv6. Flow label >>> does not exist in IPv4 – OS is not capable to break IPv4 Anycast similarly. >>>> >>>> Is anybody would like to spend his/her karma to save IPv6 Anycast OR let >>> it die? >>>> It is broken already for 5 years and nobody has spotted it up to now. Is >>> it needed? >>>> (I have seen a few drafts heavily dependent on IPv6 anycast) >>>> >>>> What is proper WG for such a problem? >>> At IETF 110 Alexander gave a presentation on this in TCPM and V6OPS. See >>> the Minutes and the corresponding slides at >>> https://datatracker.ietf.org/meeting/110/proceedings >>> >>> At least at the TCPM meeting, it was suggested that an ID would be written. >>> >>> However, the behaviour you are describing, is implementation specific to >>> Linux, this is not described or >>> recommended by an RFC. >>> >>> Best regards >>> Michael >>>> >>>> I am concerned that Anycast has been killed, it is not an easily >>> replaceable tool. >>>> Maybe somebody would propose something better but if not >>>> then LINUX should be returned to 2015 when flow label change on RTO was >>> a non-default configuration. >>>> Such LINUX behavior could be valuable in some restricted domains (see >>> below) when the administrator is sure that Anycast is not possible on the >>> traffic path. >>>> >>>> Eduard >>>> From: Vasilenko Eduard >>>> Sent: Tuesday, August 3, 2021 12:05 PM >>>> To: 'Jeff Tantsura' <jefftant.ietf@xxxxxxxxx>; Alexander Azimov < >>> a.e.azimov@xxxxxxxxx> >>>> Cc: Alexander Azimov <mitradir@xxxxxxxxxxxxxx>; routing WG < >>> rtgwg@xxxxxxxx> >>>> Subject: RE: Self-healing Networking with Flow Label >>>> >>>> Hi all, >>>> Not many people worldwide read this alias and understand >>>> That RTO could be leveraged to fight “silent drops” in the DC >>> environment. >>>> It is a good use case to publish/document (with more details that it was >>> in the presentation). >>>> I hope that in the future OAM would be used for this purpose – it is >>> better from architecture point of view. >>>> Eduard >>>> From: Jeff Tantsura [mailto:jefftant.ietf@xxxxxxxxx] >>>> Sent: Tuesday, August 3, 2021 1:08 AM >>>> To: Alexander Azimov <a.e.azimov@xxxxxxxxx> >>>> Cc: Vasilenko Eduard <vasilenko.eduard@xxxxxxxxxx>; Alexander Azimov < >>> mitradir@xxxxxxxxxxxxxx>; routing WG <rtgwg@xxxxxxxx> >>>> Subject: Re: Self-healing Networking with Flow Label >>>> >>>> Eduard, >>>> >>>> The idea of the draft to come is to explain what to do - when and how. >>>> The goal is not to regulate (we really don’t) but to provide, similarly >>> to RFC7938 a set of guidelines that community can use to build better and >>> more resilient networks. >>>> >>>> Cheers, >>>> Jeff >>>> >>>> >>>> On Aug 2, 2021, at 04:01, Alexander Azimov <a.e.azimov@xxxxxxxxx> wrote: >>>> >>>> >>>> Eduard, >>>> >>>> пн, 2 авг. 2021 г. в 13:45, Vasilenko Eduard < >>> vasilenko.eduard@xxxxxxxxxx>: >>>> It is the key in this presentation “This behavior MUST be switched off >>> by default” >>>> It has been shown on slides 7-10 that flow label change on RTO is >>> enabled by default for BSD and LINUX. >>>> It needs regulation. It needs a standard RFC. Because it kills Anycast >>> otherwise. >>>> As I'm partially responsible for the key points of the presentation, I >>> can stress that it is a bit different. >>>> • We have an opportunity for self-healing TCP on top of IPv6, it >>> should be preserved; >>>> • The Linux defaults should be changed to a safe mode to prevent >>> session timeouts; >>>> • The hash recalculation behavior should be documented; >>>> I'm not sure what you mean by the term 'regulation'. >>>> >>>> The story of how to use RTO to work-around “silent drop” vendor’s bugs >>> could be a good informational RFC. >>>> My be people developing iOAM would pay more attention to this use case. >>>> >>>> IMHO: these are 2 separate drafts. >>>> I'm not sure about it, we'll try to provide -00 before the next IETF >>> meeting, let's see how it progresses. >>>> >>>> Eduard >>>> From: Alexander Azimov [mailto:mitradir@xxxxxxxxxxxxxx] >>>> Sent: Monday, August 2, 2021 1:20 PM >>>> To: Vasilenko Eduard <vasilenko.eduard@xxxxxxxxxx>; Jeff Tantsura < >>> jefftant.ietf@xxxxxxxxx> >>>> Cc: routing WG <rtgwg@xxxxxxxx> >>>> Subject: Re: Self-healing Networking with Flow Label >>>> >>>> Eduard, >>>> >>>> Please see the quote from the slide 28. My suggestion was: >>>> >>>> Client – sends SYN, Server – responds with SYN&ACK >>>> • In case of SYN_RTO or RTO events Server SHOULD recalculate its >>> TCP socket hash, thus change Flow Label. This behavior MAY be switched on >>> by default; >>>> • In case of SYN_RTO or RTO events Client MAY recalculate its TCP >>> socket hash, thus change Flow Label. This behavior MUST be switched off by >>> default; >>>> This looks like a safe default behavior, that saves the part of the >>> improvements, but makes the work with stateful anycast services safe. >>>> >>>> And yes, IMO it's ok to have a knob to enable it in the controlled >>> environment. If you ask how to enable it in the presence of internal >>> anycast services - there was also a suggestion in the slides: eBPF. It >>> gives a good way to make this kind of separation. >>>> >>>> 02.08.2021, 11:48, "Vasilenko Eduard" <vasilenko.eduard@xxxxxxxxxx>: >>>> Hi Jeff, >>>> The situation when Control Plane does not understand what the Forwarding >>> pane doing is a bug. >>>> Yes, RTO in TCP helps to find a work-around for this bug. And yes, >>> Anycast is typically absent inside DC – it does not create the problem in >>> the DC environment. >>>> >>>> But the same LINUX is used outside DC. RTO Flow Label change here would >>> create even more problems if Anycast would happen on the traffic path (not >>> much predictable for client). >>>> Do we need separate LINUX distribution for DC and separate distribution >>> for other environments? >>>> Or should we rely on the proper non-default configuration for different >>> environments? (Admin should not forget to change) >>>> What if Anycast would become needed in DC? >>>> >>>> RTO flow label recalculation is mutually exclusive with Anycast on the >>> traffic part. >>>> What is more valuable for the public? >>>> >>>> IMHO: It is better to fight the problem of such type of a bug with iOAM >>> than to cancel Anycast. >>>> >>>> IMHO: It is better to have Flow Label recalculation on RTO as “off” by >>> default. >>>> DC Admin has the higher qualification to activate it in a controlled >>> environment than every client worldwide that should not forget to disable >>> it. >>>> >>>> Eduard >>>> From: Jeff Tantsura [mailto:jefftant.ietf@xxxxxxxxx] >>>> Sent: Monday, August 2, 2021 6:56 AM >>>> To: Vasilenko Eduard <vasilenko.eduard@xxxxxxxxxx> >>>> Cc: mitradir@xxxxxxxxxxxxxx; routing WG <rtgwg@xxxxxxxx> >>>> Subject: Re: Self-healing Networking with Flow Label >>>> >>>> Eduard, >>>> >>>> The issue is present not in link/device case, if well implemented - fast >>> rehash takes care of updating forwarding within a number of ms. The problem >>> is with “gray” failures, when the link in question is UP from >>> routing/forwarding prospective but drops traffic (mostly occasionally and >>> when a HW bug occurs has a distinct flow attributes). >>>> >>>> In many large DC fabrics, the majority of the traffic is east-west, >>> between end-points that aren’t anycast. In such deployments - the solution >>> solves issues rather elegantly and without any interventions from the >>> operator. >>>> The issues/side effects are well understood and will be documented. >>>> >>>> The best way to receive RTGWG emails is well… to subscribe to RTGWG ;-) >>>> >>>> Cheers, >>>> Jeff >>>> >>>> >>>> On Aug 1, 2021, at 09:47, Vasilenko Eduard <vasilenko.eduard@xxxxxxxxxx> >>> wrote: >>>> >>>> >>>> Hi Alexander, >>>> >>>> Have I understood your presentation right? >>>> The client SHOULD change IPv6 flow label after SYN RTO to have a chance >>> to be moved to the working path inside DC fabric (if DC fabric supports >>> flow label for hash calculation) >>>> But at the same time >>>> The client SHOULD NOT change the IPv6 flow label after SYN RTO to avoid >>> being switched to a different TCP proxy engine. >>>> >>>> Looks like a deadlock, especially if both things should happen for the >>> same traffic: >>>> it should reach DC fabric >>>> and >>>> it should be hash load-balanced between different TCP proxy engines (or >>> applications) inside DC Fabric. >>>> >>>> I see one bad solution (“Disable Flow Label”): >>>> Routers up to TCP proxy engine SHOULD be configured not to use flow >>> label (by the way these are all routers on the Internet), >>>> TCP flow engines SHOULD be outside of the DC Fabric (CLOS) – probably in >>> front of it. >>>> Routers/Switches inside DC Fabric SHOULD use flow labels. >>>> >>>> I see another bad solution (“Disable Anycast”): >>>> Disable anycast on routers in principle, use only stateful LB. >>>> >>>> >>>> It has been commented in the chat that Anycast is not possible in >>> principle for stateful connection. It is too general a statement. >>>> Anycast is just not compatible with Flow Label. It is not a problem for >>> IPv4 anycast even if the connection is stateful (TCP) because 5-tuple for >>> hash would not change. >>>> Hence, IPv6 anycast has become dead at the time when Flow Label change >>> has been added in LINUX for active TCP session. >>>> >>>> Among 3 thins: >>>> - Anycast >>>> - Flow Label load balancing (basic Flow Label functionality) >>>> - Flow Label change on the active session for application to be >>> more active in new path search >>>> You have to choose which one to kill – all 3 are not compatible with >>> each other at the same. >>>> I vote to disable Flow Label change in LINUX. Then wait till the network >>> would fix itself. >>>> We have so many fancy TE tools our days. A broken link or a broken node >>> could be excluded from routing for 50ms. >>>> >>>> PS: I am not subscribed to the RTGWG alias, please keep me on a copy of >>> this thread. >>>> <image001.png> >>>> Best Regards >>>> Eduard Vasilenko >>>> Senior Architect >>>> Europe Standardization & Industry Development Department >>>> Tel: +7(985) 910-1105, +7(916) 800-5506 >>>> >>>> _______________________________________________ >>>> rtgwg mailing list >>>> rtgwg@xxxxxxxx >>>> https://www.ietf.org/mailman/listinfo/rtgwg >>>> >>>> >>>> -- >>>> Best regards, >>>> Alexander Azimov >>>> >>>> _______________________________________________ >>>> rtgwg mailing list >>>> rtgwg@xxxxxxxx >>>> https://www.ietf.org/mailman/listinfo/rtgwg >>>> >>>> >>>> -- >>>> Best regards, >>>> Alexander Azimov >>> >>> > > -- > --- > tte@xxxxxxxxx >