From: Dexuan Cui <decui@xxxxxxxxxxxxx> Sent: Friday, October 25, 2024 11:19 AM > > > From: Michael Kelley <mhklinux@xxxxxxxxxxx> > > Sent: Tuesday, October 22, 2024 11:04 AM > > [...] > > I wasn't aware of the VF handling. Where does the guest notify the > > host that it is planning to hibernate? I went looking for such code, but > > couldn't immediately find it. Is it in the netvsc driver? Is this the > > sequence? > > > > 1) The guest notifies the host of the hibernate > > 2) The host sends a RESCIND_CHANNELOFFER message for each VF > > in the VM > > 3) The guest waits for all VF rescind processing to complete, and > > also must ensure that no new VFs get added in the meantime > > 4) Then the guest proceeds with the hibernation, knowing that there > > are no open channels for VF devices > > When a hibernated VM resumes on a different host, it looks like the host team > thinks that it's difficult to remember the VMBus device Instance GUID for the > VF, and use the same GUID on the new host. When the new host uses a new > Instance GUID for the VF, a Windows VM panics, and a Linux VM prints a > warning and IIRC loses the ability to hibernate again due to a check in the > VMBus driver. > > So, as a workaround, the host team decides to remove the VF(s) before > asking the VM to hibernate. The sequence of a "host-initiated VM hibernation" > is: > 1) a user clicks the "Hibernation" button on the portal (or uses the equivalent > cmd line or API). > > 2) Internally, the host temporarily disables AccelNet for the vNICs, i.e. sending > PCI_EJECT and RESCIND_CHANNELOFFER for each VF. > > 3) The guest responds accordingly, including sending PCI_EJECTION_COMPLETE > and CHANNELMSG_RELID_RELEASED. > > 4) Once the host knows that AccelNet has been disabled for the VM, the host > Sends a "please hibernate" message to the guest via the Shutdown IC. > > 5) The guest proceeds with the hibernation, knowing that there are no open > channels for VF devices and assuming no new VF will be offered during the > hibernation operation. > > 6) When the VM finishes hibernation and powers off, the host internally enables > AccelNet for the VM so that when the VM resumes on a new host, the new host > can offer a VF with a different VMBus device instance GUID. > > The above is for a "host-initiated VM hibernation". > > Currently, the Azure team doesn't support a "VM-initiated hibernation", where > the host has no opportunity to temporarily disable AccelNet. Maybe > "VM-initiated hibernation" can be supported when MANA-Direct is used (i.e. > no more NetVSC NICs and there are only MANA VF NICs): in this scenario, I > suppose the host must remember the MANA VF's VMBus device Instance GUID > and use the same GUID on the new host. > Thanks for the information, Dexuan! I'm thinking about hibernation a bit more, and perhaps will write a Linux kernel documentation topic under Documentation/virt/hyperv that covers the full set of scenarios. The Hyper-V interactions and assumptions are more complex than I had realized. Getting them formally documented should be helpful in the long run. Michael > > > The behavior we want is for the guest to hot remove the MLX device > > > driver on resume, even if the MLX device was still present at suspend, > > > so that the host does not need this special pre-hibernate behavior. This > > > patch series may not be sufficient to ensure this, though. It just moves > > > things in the right direction, by handling the all-offers-delivered > > > message. > > I'm not sure if it's a good idea to add new code to try to remove an > stale MLX VF since the scenario should not exist on Azure nowadays > (currently the host temporarily disables AccelNet during hibernation so there > should be no stale MLX VF upon resume.) > > On a local Hyper-V host, after a VM hibernates, we can manually disable > AccelNet (i.e. NIC SR-IOV) for the VM, and the VM will see a stale unresponsive > MLX VF upon resume. It would be tricky to clean up the VF gracefully: > we would have to wait for the resume callback in the Mellanox VF driver > to time out on the unresponsive VF (this can take 1 minute) and clean up the > related VMBus pass-through device backing the VF; what happens if a > host-initiated or VM-initiated hibernation is triggered during the 1 minute? > I suspect there may be some tricky race condition issues, e.g. we may > need to figure out how to synchronize the .resume with the .remove callbacks > of the MLX driver. > > I think the general assumption is that the VM's configuration should not > change at all across hibernation, but it looks like this assumption is found > to be false under some conditions from time to time... I wish the assumption > can be always true with OpenHCL.