Re: REGRESSION: relocating a Debian/bullseye guest is losing network connection

Julian Wiedmann <jwi@xxxxxxxxxxxxx> · Fri, 5 Nov 2021 11:48:23 +0100

On 04.11.21 12:53, Martin Grimm wrote:
> Am 02.11.21 um 10:03 schrieb Julian Wiedmann:
>> On 29.10.21 15:52, Martin Grimm wrote:
>>> Hi,
>>>
>>> I'm a collegue of Waldemar and I'd like to respond on his behalf.
>>>

[...]

>>>>> All manually added routing information will be lost anyway.
>>>>>
>>>>> And I might not imagine what happens to any firewall connection
>>>>> tables or ipvs connection tracking information in case of a Live
>>>>> Guest Relocation.
>>>>>
>>>>> So is there any kernel level solution for this you can think of?
>>>>>
>>>>
>>>> As discussed off-list, a plain "ip link set dev eth0 down" gives you
>>>> the same result. Hence I would recommend to improve your configuration,
>>>> so that the needed routes are restored when the interface comes up
>>>> again.
>>>>
> Your proposed test with "ip link set dev eth0 down && ip link set dev eth0 up"
> also kills all static routing information on a regular setup RHEL 7.9 system.
> So maybe it shouldn't be taken for granted that server systems with static
> network configuration recover from such outages automatically.
> 

It's taken for granted in a sense that even old code was calling dev_close()
when setting the device offline (ie. 0 > /sys/devices/qeth/x.x.xxxx/offline)
for certain config changes.

So the implicit need to preserve such user-side config was there, even when
you didn't encounter it previously.

>>>
>>> I'd like to disagree. From my point of view the state after a "real"
>>> device outage is irrelevant regarding "Live Guest Relocation".
>>>
>>> LGR is meant to provide seamless migration of zLinux guests from
>>> one z/VM to the other during production workloads.
>>> So the linux guest has to be in exactly the same state after migration
>>> to the new z/VM as it was before. That also includes IMHO dynamic
>>> routes added e.g. by a service like keepalived or even by hand.
>>>
>>
>> Sorry, unfortunately that doesn't match up with reality. LGR still requires
>> a full re-establish of the HW context (ie. you're losing whatever packets
>> are in the RX and TX queues at that moment), and then needs activity by the
>> Linux network core to establish itself in the new network environment.
>>
>> Bypassing the corresponding NETDEV_UP event etc (as the old code did) means
>> that we eg. don't get fresh GARPs, and traffic is then forwarded to stale
>> switch ports.
>>
>> So no, we can't go back to the mode of doing things behind the network
>> stack's back. It sort-of worked for a while, but we reached its limits.
>>
> 
> Sorry to hear that :-(
> For us as customers (our POV) that means LGR that worked for years without
> any noticable problem for hundreds of linux guests with thousands of successful
> relocations isn't usable anymore or only with great care.

I wouldn't want to fully demotivate you. If there are ways to bypass such a "fake"
admin-down via dev_close() but still
1. close & open the device in a sane manner so that all the relevant driver-level
   callbacks are fenced off, and
2. have the network code do all the necessary work on stop & open to harmonize us
   with a potentially new network environment, stacked devices etc
   (based on eg. linkwatch's NETDEV_CHANGE, netdev_state_change(),
   netdev_notify_peers(), ...)

then I'd consider that a very viable solution. What we _can't_ do is re-implement
that logic in the driver itself, and then just pray that we stay current with all
subsequent changes in the stack's behaviour.

> But perhaps this mailing list is not the right audience for customer problems
> and we should address this via a PMR to focus on the interaction between z/VM
> and Linux.
> 
>>> Before Kernel v5 this was the observed behavior.
>>>
>>> Starting with Kernel v5 LGR now triggers a network outage that makes
>>> it unusable for many kinds of production systems.
>>> Before version 5 there where device failure and recovery messages
>>> in the kernel log but the network configuration stayed intact.
>>>
>>> Just to be sure I compared this with the behavior of VMWare Live Migration
>>> and there all dynamic routes stay in place as it was with LGR
>>> before Kernel v5. Not a single error message in kernel log there.
>>>
>>> So if the new behavior is correct for a real device outage, maybe LGR
>>> should be handled differently?
>>>
>>>>> Thanks for any advice and comments,
>>>>>
>>>>> best regards
>>>>>    Waldemar
>>>>>
>>>>
>>>
>>> Greetings
>>> Martin
>>
>