From: Mark Rutland <mark.rutland@xxxxxxx> Sent: Friday, May 6, 2022 4:01 AM > > On Thu, May 05, 2022 at 04:51:54PM +0200, Vitaly Kuznetsov wrote: > > Mark Rutland <mark.rutland@xxxxxxx> writes: > > > > > On Thu, May 05, 2022 at 03:52:24PM +0200, Vitaly Kuznetsov wrote: > > >> "Guilherme G. Piccoli" <gpiccoli@xxxxxxxxxx> writes: > > >> > > >> > On 05/05/2022 09:53, Mark Rutland wrote: > > >> >> [...] > > >> >> Looking at those, the cleanup work is all arch-specific. What exactly would we > > >> >> need to do on arm64, and why does it need to happen at that point specifically? > > >> >> On arm64 we don't expect as much paravirtualization as on x86, so it's not > > >> >> clear to me whether we need anything at all. > > >> >> > > >> >>> Anyway, the idea here was to gather a feedback on how "receptive" arm64 > > >> >>> community would be to allow such customization, appreciated your feedback =) > > >> >> > > >> >> ... and are you trying to do this for Hyper-V or just using that as an example? > > >> >> > > >> >> I think we're not going to be very receptive without a more concrete example of > > >> >> what you want. > > >> >> > > >> >> What exactly do *you* need, and *why*? Is that for Hyper-V or another hypervisor? > > >> >> > > >> >> Thanks > > >> >> Mark. > > >> > > > >> > Hi Mark, my plan would be doing that for Hyper-V - kind of the same > > >> > code, almost. For example, in hv_crash_handler() there is a stimer > > >> > clean-up and the vmbus unload - my understanding is that this same code > > >> > would need to run in arm64. Michael Kelley is CCed, he was discussing > > >> > with me in the panic notifiers thread and may elaborate more on the needs. > > >> > > > >> > But also (not related with my specific plan), I've seen KVM quiesce code > > >> > on x86 as well [see kvm_crash_shutdown() on arch/x86] , I'm not sure if > > >> > this is necessary for arm64 or if this already executing in some > > >> > abstracted form, I didn't dig deep - probably Vitaly is aware of that, > > >> > hence I've CCed him here. > > >> > > >> Speaking about the difference between reboot notifiers call chain and > > >> machine_ops.crash_shutdown for KVM/x86, the main difference is that > > >> reboot notifier is called on some CPU while the VM is fully functional, > > >> this way we may e.g. still use IPIs (see kvm_pv_reboot_notify() doing > > >> on_each_cpu()). When we're in a crash situation, > > >> machine_ops.crash_shutdown is called on the CPU which crashed. We can't > > >> count on IPIs still being functional so we do the very basic minimum so > > >> *this* CPU can boot kdump kernel. There's no guarantee other CPUs can > > >> still boot but normally we do kdump with 'nprocs=1'. > > > > > > Sure; IIUC the IPI problem doesn't apply to arm64, though, since that doesn't > > > use a PV mechanism (and practically speaking will either be GICv2 or GICv3). > > > > > > > This isn't really about PV: when the kernel is crashing, you have no > > idea what's going on on other CPUs, they may be crashing too, locked in > > a tight loop, ... so sending an IPI there to do some work and expecting > > it to report back is dangerous. > > Sorry, I misunderstood what you meant about IPIs. I thought you meant that some > enlightened IPI mechanism might be broken, rather than you simply cannot rely > on secondary CPUs to do anything (which is true regardless of whether the > kernel is running under a hypervisor). > > So I understand not calling all the regular reboot notifiers in case they do > something like that, but it seems like we should be able to do that with a > panic notifier, since that could *should* follow the principle that you can't > rely on a working IPI. > > [...] > > > >> There's a crash_kexec_post_notifiers mechanism which can be used instead > > >> but it's disabled by default so using machine_ops.crash_shutdown is > > >> better. > > > > > > Another option is to defer this to the kdump kernel. On arm64 at least, we know > > > if we're in a kdump kernel early on, and can reset some state based upon that. > > > > > > Looking at x86's hyperv_cleanup(), everything relevant to arm64 can be deferred > > > to just before the kdump kernel detects and initializes anything relating to > > > hyperv. So AFAICT we could have hyperv_init() check is_kdump_kernel() prior to > > > the first hypercall, and do the cleanup/reset there. > > > > In theory yes, it is possible to try sending CHANNELMSG_UNLOAD on kdump > > kernel boot and not upon crash, I don't remember if this approach was > > tried in the past. > > > > > Maybe we need more data for the vmbus bits? ... if so it seems that could blow > > > up anyway when the first kernel was tearing down. > > > > Not sure I understood what you mean... From what I remember, there were > > issues with CHANNELMSG_UNLOAD handling on the Hyper-V host side in the > > past (it was taking *minutes* for the host to reply) but this is > > orthogonal to the fact that we need to do this cleanup so kdump kernel > > is able to connect to Vmbus devices again. > > I was thinking that if it was necessary to have some context (e.g. pointers to > buffers which are active) in order to do the teardown, it might be painful to > do that in the kdump kernel itself. > > Otherwise, I think doing the teardown in the kdump kernel itself would be > preferable, since there's a greater likelihood that kernel infrastructure will > work relative to doing that in the kernel which crashed, and it gives the kdump > kernel the option to detect when something cannot be torn down, and not use > that feature. > Apologies for the delay in joining this thread. In addition to being out on vacation, I've been doing some further investigation to make sure I have my info right. The idea of doing the VMbus teardown in the kdump kernel itself is intriguing, but has its own problems. Sending the CHANNELMSG_UNLOAD in the kdump kernel should work OK. But Hyper-V will ack the command, the ack comes back into a queue in the original kernel memory. We can't re-initiate the VMbus connection in the kdump kernel until we have the ack. We don't need any data from the ack, so we *could* just wait 100 seconds and assume the ack has come in (in unusual cases, the ack really can take that long for reasons documented in the code). Given a choice of doing the VMbus teardown in the kdump kernel (including waiting an extra 100 seconds) vs. doing the teardown in a panic notifier in the original kernel, I think the panic notifier approach is preferable. The risk of a failure that prevents kdump from working seems only very slightly higher when the teardown is done in the original kernel. The bigger problem is with a normal kexec(). On x86/x64, we depend on the machine_ops.shutdown() to run the code to do the VMbus teardown. Today, the teardown isn't happening at all on ARM64, leaving kexec() at risk of a variety of failures. kexec() shuts down all devices, so individual VMbus synthetic devices get properly shutdown. But VMbus is bus, not a device, and the VMbus connection is managed by the VMbus bus driver. From what I can see, there's no mechanism to explicitly shut down busses upon kexec(). Just brainstorming, I'm wondering if we could create a dummy VMbus device that would teardown the VMbus connection in the case of a kexec(). Kexec() appears to explicitly shutdown devices in reverse order, which would work since the dummy device can be created before any of the other synthetic VMbus devices show up. I'm open to other ideas as well. I understand the desire not to open floodgates by adding the equivalent of machine_ops on the ARM64 side. Michael