On 04.11.21 12:53, Martin Grimm wrote: > Am 02.11.21 um 10:03 schrieb Julian Wiedmann: >> On 29.10.21 15:52, Martin Grimm wrote: >>> Hi, >>> >>> I'm a collegue of Waldemar and I'd like to respond on his behalf. >>> [...] >>>>> All manually added routing information will be lost anyway. >>>>> >>>>> And I might not imagine what happens to any firewall connection >>>>> tables or ipvs connection tracking information in case of a Live >>>>> Guest Relocation. >>>>> >>>>> So is there any kernel level solution for this you can think of? >>>>> >>>> >>>> As discussed off-list, a plain "ip link set dev eth0 down" gives you >>>> the same result. Hence I would recommend to improve your configuration, >>>> so that the needed routes are restored when the interface comes up >>>> again. >>>> > Your proposed test with "ip link set dev eth0 down && ip link set dev eth0 up" > also kills all static routing information on a regular setup RHEL 7.9 system. > So maybe it shouldn't be taken for granted that server systems with static > network configuration recover from such outages automatically. > It's taken for granted in a sense that even old code was calling dev_close() when setting the device offline (ie. 0 > /sys/devices/qeth/x.x.xxxx/offline) for certain config changes. So the implicit need to preserve such user-side config was there, even when you didn't encounter it previously. >>> >>> I'd like to disagree. From my point of view the state after a "real" >>> device outage is irrelevant regarding "Live Guest Relocation". >>> >>> LGR is meant to provide seamless migration of zLinux guests from >>> one z/VM to the other during production workloads. >>> So the linux guest has to be in exactly the same state after migration >>> to the new z/VM as it was before. That also includes IMHO dynamic >>> routes added e.g. by a service like keepalived or even by hand. >>> >> >> Sorry, unfortunately that doesn't match up with reality. LGR still requires >> a full re-establish of the HW context (ie. you're losing whatever packets >> are in the RX and TX queues at that moment), and then needs activity by the >> Linux network core to establish itself in the new network environment. >> >> Bypassing the corresponding NETDEV_UP event etc (as the old code did) means >> that we eg. don't get fresh GARPs, and traffic is then forwarded to stale >> switch ports. >> >> So no, we can't go back to the mode of doing things behind the network >> stack's back. It sort-of worked for a while, but we reached its limits. >> > > Sorry to hear that :-( > For us as customers (our POV) that means LGR that worked for years without > any noticable problem for hundreds of linux guests with thousands of successful > relocations isn't usable anymore or only with great care. I wouldn't want to fully demotivate you. If there are ways to bypass such a "fake" admin-down via dev_close() but still 1. close & open the device in a sane manner so that all the relevant driver-level callbacks are fenced off, and 2. have the network code do all the necessary work on stop & open to harmonize us with a potentially new network environment, stacked devices etc (based on eg. linkwatch's NETDEV_CHANGE, netdev_state_change(), netdev_notify_peers(), ...) then I'd consider that a very viable solution. What we _can't_ do is re-implement that logic in the driver itself, and then just pray that we stay current with all subsequent changes in the stack's behaviour. > But perhaps this mailing list is not the right audience for customer problems > and we should address this via a PMR to focus on the interaction between z/VM > and Linux. > >>> Before Kernel v5 this was the observed behavior. >>> >>> Starting with Kernel v5 LGR now triggers a network outage that makes >>> it unusable for many kinds of production systems. >>> Before version 5 there where device failure and recovery messages >>> in the kernel log but the network configuration stayed intact. >>> >>> Just to be sure I compared this with the behavior of VMWare Live Migration >>> and there all dynamic routes stay in place as it was with LGR >>> before Kernel v5. Not a single error message in kernel log there. >>> >>> So if the new behavior is correct for a real device outage, maybe LGR >>> should be handled differently? >>> >>>>> Thanks for any advice and comments, >>>>> >>>>> best regards >>>>> Waldemar >>>>> >>>> >>> >>> Greetings >>> Martin >> >