Re: IPv6 Anycast has been killed by LINUX patch in 2016 - who cares?

Jeff Tantsura <jefftant.ietf@xxxxxxxxx> · Fri, 6 Aug 2021 22:13:53 -0700

It is easy to route around broken links, in a well designed DC we can reroute (really re-forward) within 50ms after the failure has been detected without FRR (aka fast-rehash).
Gray failures are a very different kind of animals, while you might think - how often do the optics degrade or fibers get dirty, at a large scale this is a constant problem, and it takes time (minutes if we are lucky) to detect and safely isolate.
While it might be not to someone’ taste  - it is a working solution for the aforementioned problem for large DC fabrics. 

Cheers,
Jeff

> On Aug 6, 2021, at 18:48, Toerless Eckert <tte@xxxxxxxxx> wrote:
> 
> [bitching]
> I apologize for attempting to respond to the original post topic instead
> of derailing the thread into my pet side topic without changing subject,
> which seems to be expected behavior on ietf@xxxxxxxx.
> [/bitching]
> 
> Adding ipv6@xxxxxxxx as that seems to be the closest WG list for the topic.
> 
> Brian reminded us that we have ample RFCs to elaborate on the fact that you can not
> reasonably expect for connections to an anycast address to work when persistently
> using the anycast address. 
> 
> Christian pointed out how QUIC does the right thing. Great! Maybe we should
> have a an anycast support hall of fame and shame for protocols: DOes or does it
> not support single round-trip resolution of anycast to unicast address.
> 
> But back to what seems to be the root cause, which isn't anycast, but IPv6
> flow label "abuse" ?!
> 
> I specifically had not heard of this Linux "hack" to change flow-label
> mid-connection after TCP RTO to overcome a seemingly broken path and hope for 
> the new flow-label to pick another, working path (most likely in a data cener).
> 
> I do not think that this endpoint behavior is endorsed by RFC6437 and by absence
> of a description in RFC6437 under which circumstances an endpoint could or should
> change the Flow label for an in-progress transport connection, i would conclude 
> that this linux behavior is not in compliance with how the normative RFC6437
> describes proper host assignment of flow label.
> 
> Could someone who sleeps with RFC6437 under his pillow comment on whether
> or not my assessment is accurate ? I think to remember Bob mentioned he just
> carefully read through all those flow label RFCs...
> 
> Of course, pointing out to linux that what it does is a hack would not make
> TCP to an anycast address any less of a hack.
> 
> Ultimately, i have not a lot of sympathy for the linux behavior, even if it
> was blessed by RFC6437, because i think good networks should fix broken paths
> fast enough for this hack to be not necessary...
> 
> Cheers
>    Toerless
> 
>> On Tue, Aug 03, 2021 at 10:45:29PM +1200, Brian Carpenter wrote:
>> The issue of anycast and unstable routes is hardly a new discovery; this
>> Linux feature is not creating a new problem. I suggest reading RFC7094 and
>> RFC4786 before continuing this conversation.
>> 
>> I certainly wouldn't design a protocol that relied on stable transport
>> connections to an anycast address.
>> 
>> Regards,
>>    Brian Carpenter
>>    (via tiny screen & keyboard)
>> 
>> On Tue, 3 Aug 2021, 22:10 Michael Tuexen, <michael.tuexen@xxxxxxxxxxxxxxxxx>
>> wrote:
>> 
>>>> On 3. Aug 2021, at 11:44, Vasilenko Eduard <vasilenko.eduard@xxxxxxxxxx>
>>> wrote:
>>>> 
>>>> Hi all,
>>>> I am writing to this alias because I do not know the proper one for such
>>> type of a problem (OS/LINUX/BSD).
>>>> The history of how Alexander Azimov (Yandex) has found the problem is
>>> below.
>>>> 
>>>> In short: if TCP loses connectivity for 200ms (or 1s in SYN stage) then
>>> TCP changes IPv6 flow label (for the active TCP session!) to push traffic
>>> to a different path.
>>>> Current networks are extensively ECMP, if intermediate nodes support
>>> flow label for hash calculation then a high probability that the path would
>>> be changed.
>>>> LINUX/BSD does not want to wait till the network will fix its problem.
>>> As far as I know, Linux implements something you describe, but I'm not
>>> aware on this behaviour being
>>> implemented in *BSD, at least not in FreeBSD.
>>>> 
>>>> If the final destination was anycast then the final destination would be
>>> changed too by the same hash calculation.
>>>> The stateful session would be broken as a result (see the second part of
>>> Alexander’s presentation below).
>>>> 
>>>> Since the time LINUX has made the default RTO flow label recalculation
>>> (2016), IPv6 Anycast is broken.
>>>> People would have one more reason not to migrate to IPv6. Flow label
>>> does not exist in IPv4 – OS is not capable to break IPv4 Anycast similarly.
>>>> 
>>>> Is anybody would like to spend his/her karma to save IPv6 Anycast OR let
>>> it die?
>>>> It is broken already for 5 years and nobody has spotted it up to now. Is
>>> it needed?
>>>> (I have seen a few drafts heavily dependent on IPv6 anycast)
>>>> 
>>>> What is proper WG for such a problem?
>>> At IETF 110 Alexander gave a presentation on this in TCPM and V6OPS. See
>>> the Minutes and the corresponding slides at
>>> https://datatracker.ietf.org/meeting/110/proceedings
>>> 
>>> At least at the TCPM meeting, it was suggested that an ID would be written.
>>> 
>>> However, the behaviour you are describing, is implementation specific to
>>> Linux, this is not described or
>>> recommended by an RFC.
>>> 
>>> Best regards
>>> Michael
>>>> 
>>>> I am concerned that Anycast has been killed, it is not an easily
>>> replaceable tool.
>>>> Maybe somebody would propose something better but if not
>>>> then LINUX should be returned to 2015 when flow label change on RTO was
>>> a non-default configuration.
>>>> Such LINUX behavior could be valuable in some restricted domains (see
>>> below) when the administrator is sure that Anycast is not possible on the
>>> traffic path.
>>>> 
>>>> Eduard
>>>> From: Vasilenko Eduard
>>>> Sent: Tuesday, August 3, 2021 12:05 PM
>>>> To: 'Jeff Tantsura' <jefftant.ietf@xxxxxxxxx>; Alexander Azimov <
>>> a.e.azimov@xxxxxxxxx>
>>>> Cc: Alexander Azimov <mitradir@xxxxxxxxxxxxxx>; routing WG <
>>> rtgwg@xxxxxxxx>
>>>> Subject: RE: Self-healing Networking with Flow Label
>>>> 
>>>> Hi all,
>>>> Not many people worldwide read this alias and understand
>>>> That RTO could be leveraged to fight “silent drops” in the DC
>>> environment.
>>>> It is a good use case to publish/document (with more details that it was
>>> in the presentation).
>>>> I hope that in the future OAM would be used for this purpose – it is
>>> better from architecture point of view.
>>>> Eduard
>>>> From: Jeff Tantsura [mailto:jefftant.ietf@xxxxxxxxx]
>>>> Sent: Tuesday, August 3, 2021 1:08 AM
>>>> To: Alexander Azimov <a.e.azimov@xxxxxxxxx>
>>>> Cc: Vasilenko Eduard <vasilenko.eduard@xxxxxxxxxx>; Alexander Azimov <
>>> mitradir@xxxxxxxxxxxxxx>; routing WG <rtgwg@xxxxxxxx>
>>>> Subject: Re: Self-healing Networking with Flow Label
>>>> 
>>>> Eduard,
>>>> 
>>>> The idea of the draft to come is to explain what to do - when and how.
>>>> The goal is not to regulate (we really don’t) but to provide, similarly
>>> to RFC7938 a set of guidelines that community can use to build better and
>>> more resilient networks.
>>>> 
>>>> Cheers,
>>>> Jeff
>>>> 
>>>> 
>>>> On Aug 2, 2021, at 04:01, Alexander Azimov <a.e.azimov@xxxxxxxxx> wrote:
>>>> 
>>>> 
>>>> Eduard,
>>>> 
>>>> пн, 2 авг. 2021 г. в 13:45, Vasilenko Eduard <
>>> vasilenko.eduard@xxxxxxxxxx>:
>>>> It is the key in this presentation “This behavior MUST be switched off
>>> by default”
>>>> It has been shown on slides 7-10 that flow label change on RTO is
>>> enabled by default for BSD and LINUX.
>>>> It needs regulation. It needs a standard RFC. Because it kills Anycast
>>> otherwise.
>>>> As I'm partially responsible for the key points of the presentation, I
>>> can stress that it is a bit different.
>>>>      • We have an opportunity for self-healing TCP on top of IPv6, it
>>> should be preserved;
>>>>      • The Linux defaults should be changed to a safe mode to prevent
>>> session timeouts;
>>>>      • The hash recalculation behavior should be documented;
>>>> I'm not sure what you mean by the term 'regulation'.
>>>> 
>>>> The story of how to use RTO to work-around “silent drop” vendor’s bugs
>>> could be a good informational RFC.
>>>> My be people developing iOAM would pay more attention to this use case.
>>>> 
>>>> IMHO: these are 2 separate drafts.
>>>> I'm not sure about it, we'll try to provide -00 before the next IETF
>>> meeting, let's see how it progresses.
>>>> 
>>>> Eduard
>>>> From: Alexander Azimov [mailto:mitradir@xxxxxxxxxxxxxx]
>>>> Sent: Monday, August 2, 2021 1:20 PM
>>>> To: Vasilenko Eduard <vasilenko.eduard@xxxxxxxxxx>; Jeff Tantsura <
>>> jefftant.ietf@xxxxxxxxx>
>>>> Cc: routing WG <rtgwg@xxxxxxxx>
>>>> Subject: Re: Self-healing Networking with Flow Label
>>>> 
>>>> Eduard,
>>>> 
>>>> Please see the quote from the slide 28. My suggestion was:
>>>> 
>>>> Client – sends SYN, Server – responds with SYN&ACK
>>>>      • In case of SYN_RTO or RTO events Server SHOULD recalculate its
>>> TCP socket hash, thus change Flow Label. This behavior MAY be switched on
>>> by default;
>>>>      • In case of SYN_RTO or RTO events Client MAY recalculate its TCP
>>> socket hash, thus change Flow Label. This behavior MUST be switched off by
>>> default;
>>>> This looks like a safe default behavior, that saves the part of the
>>> improvements, but makes the work with stateful anycast services safe.
>>>> 
>>>> And yes, IMO it's ok to have a knob to enable it in the controlled
>>> environment. If you ask how to enable it in the presence of internal
>>> anycast services - there was also a suggestion in the slides: eBPF. It
>>> gives a good way to make this kind of separation.
>>>> 
>>>> 02.08.2021, 11:48, "Vasilenko Eduard" <vasilenko.eduard@xxxxxxxxxx>:
>>>> Hi Jeff,
>>>> The situation when Control Plane does not understand what the Forwarding
>>> pane doing is a bug.
>>>> Yes, RTO in TCP helps to find a work-around for this bug. And yes,
>>> Anycast is typically absent inside DC – it does not create the problem in
>>> the DC environment.
>>>> 
>>>> But the same LINUX is used outside DC. RTO Flow Label change here would
>>> create even more problems if Anycast would happen on the traffic path (not
>>> much predictable for client).
>>>> Do we need separate LINUX distribution for DC and separate distribution
>>> for other environments?
>>>> Or should we rely on the proper non-default configuration for different
>>> environments? (Admin should not forget to change)
>>>> What if Anycast would become needed in DC?
>>>> 
>>>> RTO flow label recalculation is mutually exclusive with Anycast on the
>>> traffic part.
>>>> What is more valuable for the public?
>>>> 
>>>> IMHO: It is better to fight the problem of such type of a bug with iOAM
>>> than to cancel Anycast.
>>>> 
>>>> IMHO: It is better to have Flow Label recalculation on RTO as “off” by
>>> default.
>>>> DC Admin has the higher qualification to activate it in a controlled
>>> environment than every client worldwide that should not forget to disable
>>> it.
>>>> 
>>>> Eduard
>>>> From: Jeff Tantsura [mailto:jefftant.ietf@xxxxxxxxx]
>>>> Sent: Monday, August 2, 2021 6:56 AM
>>>> To: Vasilenko Eduard <vasilenko.eduard@xxxxxxxxxx>
>>>> Cc: mitradir@xxxxxxxxxxxxxx; routing WG <rtgwg@xxxxxxxx>
>>>> Subject: Re: Self-healing Networking with Flow Label
>>>> 
>>>> Eduard,
>>>> 
>>>> The issue is present not in link/device case, if well implemented - fast
>>> rehash takes care of updating forwarding within a number of ms. The problem
>>> is with  “gray” failures,  when the link in question is UP from
>>> routing/forwarding prospective but drops traffic (mostly occasionally and
>>> when a HW bug occurs has a distinct flow attributes).
>>>> 
>>>> In many large DC fabrics, the majority of the traffic is east-west,
>>> between end-points that aren’t anycast. In such deployments - the solution
>>> solves  issues rather elegantly and without any interventions from the
>>> operator.
>>>> The issues/side effects are well understood and will be documented.
>>>> 
>>>> The best way to receive RTGWG emails is well… to subscribe to RTGWG ;-)
>>>> 
>>>> Cheers,
>>>> Jeff
>>>> 
>>>> 
>>>> On Aug 1, 2021, at 09:47, Vasilenko Eduard <vasilenko.eduard@xxxxxxxxxx>
>>> wrote:
>>>> 
>>>> 
>>>> Hi  Alexander,
>>>> 
>>>> Have I understood your presentation right?
>>>> The client SHOULD change IPv6 flow label after SYN RTO to have a chance
>>> to be moved to the working path inside DC fabric (if DC fabric supports
>>> flow label for hash calculation)
>>>> But at the same time
>>>> The client SHOULD NOT change the IPv6 flow label after SYN RTO to avoid
>>> being switched to a different TCP proxy engine.
>>>> 
>>>> Looks like a deadlock, especially if both things should happen for the
>>> same traffic:
>>>> it should reach DC fabric
>>>> and
>>>> it should be hash load-balanced between different TCP proxy engines (or
>>> applications) inside DC Fabric.
>>>> 
>>>> I see one bad solution (“Disable Flow Label”):
>>>> Routers up to TCP proxy engine SHOULD be configured not to use flow
>>> label (by the way these are all routers on the Internet),
>>>> TCP flow engines SHOULD be outside of the DC Fabric (CLOS) – probably in
>>> front of it.
>>>> Routers/Switches inside DC Fabric SHOULD use flow labels.
>>>> 
>>>> I see another bad solution (“Disable Anycast”):
>>>> Disable anycast on routers in principle, use only stateful LB.
>>>> 
>>>> 
>>>> It has been commented in the chat that Anycast is not possible in
>>> principle for stateful connection. It is too general a statement.
>>>> Anycast is just not compatible with Flow Label. It is not a problem for
>>> IPv4 anycast even if the connection is stateful (TCP) because 5-tuple for
>>> hash would not change.
>>>> Hence, IPv6 anycast has become dead at the time when Flow Label change
>>> has been added in LINUX for active TCP session.
>>>> 
>>>> Among 3 thins:
>>>> -          Anycast
>>>> -          Flow Label load balancing (basic Flow Label functionality)
>>>> -          Flow Label change on the active session for application to be
>>> more active in new path search
>>>> You have to choose which one to kill – all 3 are not compatible with
>>> each other at the same.
>>>> I vote to disable Flow Label change in LINUX. Then wait till the network
>>> would fix itself.
>>>> We have so many fancy TE tools our days. A broken link or a broken node
>>> could be excluded from routing for 50ms.
>>>> 
>>>> PS: I am not subscribed to the RTGWG alias, please keep me on a copy of
>>> this thread.
>>>> <image001.png>
>>>> Best Regards
>>>> Eduard Vasilenko
>>>> Senior Architect
>>>> Europe Standardization & Industry Development Department
>>>> Tel: +7(985) 910-1105, +7(916) 800-5506
>>>> 
>>>> _______________________________________________
>>>> rtgwg mailing list
>>>> rtgwg@xxxxxxxx
>>>> https://www.ietf.org/mailman/listinfo/rtgwg
>>>> 
>>>> 
>>>> --
>>>> Best regards,
>>>> Alexander Azimov
>>>> 
>>>> _______________________________________________
>>>> rtgwg mailing list
>>>> rtgwg@xxxxxxxx
>>>> https://www.ietf.org/mailman/listinfo/rtgwg
>>>> 
>>>> 
>>>> --
>>>> Best regards,
>>>> Alexander Azimov
>>> 
>>> 
> 
> -- 
> ---
> tte@xxxxxxxxx
>