[Hypervisor Live Update] Notes from January 27, 2025

David Rientjes <rientjes@xxxxxxxxxx> · Mon, 3 Feb 2025 20:00:10 -0800 (PST)

Hi everybody,

Here are the notes from the inaugural Hypervisor Live Update call that
happened on Monday, January 27.  Thanks for everybody who was involved!

----->o-----
I talked about the logistics and goals of the biweekly.  If you would
like to be added to the calendar invite, please email me privately.  I
can also share our cover letter and shared drive with you that contains
recordings and slides (if any) if you provide an email address associated
with a Google account.

We also discussed the scope of the biweekly series to include:
 - KHO
   + Including potential early adopters (hugetlbfs, tmpfs)
 - Persistence of PCIe devices
   + IOMMU(fd) persistence
 - Guest memory
   + Including Confidential Computing use cases
 - Reboot optimizations

----->o-----
Mike is planning on sending out another KHO patch series, likely this
week (v4).

----->o-----
James discussed work last year on iommufd persistence and hooking iommu
drivers into KHO to persistent their state into kexec.  Feedback
suggested setting up new page tables and then transferring over after the
kexec is completed.  James will start implementing this on top of his
existing patch series and include qemu changes as well.  To minimize
downtime, the plan was to resume the VM with the old page tables.  He
suggested userspace would initiate the switch to the new page tables.

Jason noted KHO has been too focused on preserving memory and needs to
preserve file descriptors, we need to take iommufd, freeze it, give it to
KHO, and then pick it back up after kexec.  When you're done with it
after the hand over, like an atomic attach, then it gets destroyed.
Jason also noted we'll have to consider preserving vIOMMU state to
support latest NV hardware, which is highly complex.

Alexander Graf previously developed a concept called fdbox that turned
out to be very intrusive in the kernel.  Jason noted that all of this
work will be invasive, but we should prefer to compartmentalize it as
much as possible (like for iommufd stuff, a kho.c).

----->o-----
Pasha suggested KHO should be kept as a mechanism to preserve kernel
memory across kexec, the serialization requires different mechanisms.  He
plans to propose separate an API for callbacks into drivers.

Jason noted it was going to be critical to provide a state machine that
we all agree on, including for definitions.  One aspect he would like to
align on is whether you could put a guest_memfd into an fdbox or even a
tmpfs into an fdbox.  Mike Rapoport noted there are multiple layers here,
where KHO is very lower level and fdbox is built on top of it.  Mike
emphasized it will be critical to establish a format between multiple
kernel versions that will be standardized.

----->o-----
There was lots of discussion on stable ABIs for allowing continuous
upgrading of kernels without requiring a reboot.  Jason suggested
upstream can provide a mechanism for upgrading from 6.12 -> 6.13 but not
6.16 as an example.  Doing any version -> any version is much harder and
likely cannot be supported, at least in the short term, because it's so
invasive.  A good example would be for the mlx driver.

James and Alexander noted that we must be able to rollback and this can
be enabled by the downstream customer, it may not be a burden on the
upstream kernel.  There was a general acknowledgment that upstream pairs
must be supported, but much of this could become the responsibility of
the downstream user.  (Alexander noted some users may care about mlx,
others may not, for example.)

David Woodhouse inquired about rollback functionality and how we would
support a VM that has deserialized after kexec using a new feature and
then still support a downgrade afterwards.  Alexander said it was
important that the user of KHO supports very controlled A->B environments
for this to work properly, and, if provided, they can control downgrade
paths as well.

Dave Hansen noted this was similar to discussions about checkpoint
restart and CRIU.  The burden in this case may be very similar, that it
is taken upon by those who care about upgrading from one version to
another and that it is not a general upstream requirement.  It was
acknowledged that this will be a ton of work to maintain reliably,
however.  Dave noted it will be important to socialize the work that
needs to be done with upstream developers, but that the work will be
taken on those who care to use KHO.

It was agreed on that once you roll out, you enable new features only
when you are confident there will not be a rollback, and then once the
feature is enabled you've passed the point of no return.

----->o-----
Jason provided a nice early milestone for KHO work: demonstrate a kexec
while the VM survives and the VFIO attached to it survives.  Pasha noted
this has been done before with PKRAM, but needs to now be done in a way
that KHO would support.

----->o-----
Next meeting is scheduled for Monday, February 10 at 8am PST (UTC-8).
I'll send a reminder on this mailing list.

Topics I think we should cover in the next meeting:

 - LSF/MM/BPF topics of interest for the group
 - v4 of the KHO patch series sent out by Mike Rapoport
 - iommufd patch series (as well as qemu) sent out (hopefully) by James
   Gowans this week, otherwise a week or two from now
 - establishing an API for callbacks into drivers to serialize state
 - establishing an FSM for all of the various states that are agreed upon
   with common language
 - finalizing the decision on upstream support for minor version upgrades
   across KHO and the burden of downstream users to define what versions
   can be upgraded
 - topics proposed by Pasha: reducing blackout window, relaxed
   serialization, KHO activation requirements, and decoupling KHO from
   kexec
 - implications of preserving vIOMMU state

Please let me know if you'd like to propose additional topics for
discussion, thank you!