[Hypervisor Live Update] Notes from February 24, 2025

David Rientjes <rientjes@xxxxxxxxxx> · Sun, 2 Mar 2025 20:21:17 -0800 (PST)

Hi everybody,

Here are the notes from the last Hypervisor Live Update call that happened 
on Monday, February 24.  Thanks for everybody who was involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
We discussed the KHO v5 patch series that would be posted shortly.  Two
major changes were noted: decoupling of KHO activation from kexec load
(no KHO finalize stage triggered from userspace or kexec reboot), and
more dynamic internal representation of the tree that gets serialized
into the FDT at the end of the process.  The memory preservation
mechanism did not change.

Specifically, we were discussing the elements that were proposed for
sysfs: the activation interface, FDT blob, dt max, and scratch area
definitions.

Jason noted a general concern on all of the UAPIs that are being
developed and suggested scaling them back.  Mike noted also noted a
concern with the activation trigger: we may want it to represent the
state in the KHO state machine.  He also agreed with Jason that dt max
may want to be removed altogether.  The suggestion would be to hard code
for prototyping, Mike said this should not be in sysfs.

Mike noted that recent changes in KHO would do vmap and vmalloc in kexec
reboot.  Mike said that we need to have a point where state becomes
immutable and Jason suggested a simple read from sysfs, i.e. when you
read byte zero of the sysfs file the kernel would go and do the callback.
This could at least be used early in development.

----->o-----
Jason suggested that we agree on the FSM first because of
unpredictability with future changes, especially for UAPI.  I suggested
that we align this across stakeholders as soon as possible, especially
before going into v6 or v7 of KHO.  In this case, I was referring to both
state machines for the live update process as well as for activation.

Jason pushed to understand what the scope of the v5 that will be sent out
will be.  Is the objective to get some things merged as a foot in the
door, or is it something that is complete and does everything we intend
to do?  Mike suggested getting started with something that was minimal to
address concerns from devicetree and kexec communities.

Jason strongly suggested minimizing the UAPI in this case and relying on
the dt blob for now.  I suggested perhaps we should use debugfs for now
instead of relying on sysfs where we have more flexibility for changes.
Jason noted this may not be sysfs in the end anyway, we might have
different interfaces later.

Mike noted that Alex may have a strong opinion about this, but would
suspect that this is ok for debugfs.

----->o-----
Mike discussed playing around with hugetlb, Jason suggested causing the
fd's to round trip through the kexec.  If hugetlb is thrown into fdbox,
then this informs the kernel that this memory should be preserved.  This
would be similar for memfd.  Jason suggested against something like
guestmemfs and pointed in the direction of fdbox instead.

David Matlack asked what would be the point of the fdbox for things
already in a filesystem.  Jason noted this was for the concept of
creating the inode in the first place (could be an anonymous file
description), then we give it a label, and expect it to be there on the
other side of the kexec.  Instead of hugetlbfs, this might be a memfd
that supports 1GB hugepages; additionally, guest_memfd development has
been trying to untangle 1GB hugepages from guest_memfd in general.

We discussed what metadata should be preserved, independent of whether it
is regular memfd or guest_memfd.  The other side of the kexec would do
memfd_create() to restore the filesystem on the new kernel.

Pratyush took the AI to look at implementation on memfd.  Mike took the
AI to look at memory reservation.

----->o-----
I pivoted the discussion toward fdbox which ended up never being posted
upstream.  Pratyush provided the link to the most recent code:
https://github.com/agraf/linux-2.6/blob/kvm-kho-gmem-test/drivers/misc/fdbox.c
This work likely would need to be picked up to pursue upstream because
the current code has several TODO's.  (A struct miscdevice was called out
as curious.)

It was noted that it would be difficult to support memfd without
something like fdbox, so whether the fdbox code itself were upstreamed or
it becomes more generic from the work on memfd, the base support would
need to be provided somehow.

Mike and Jason suggested designing fdbox and then starting to use it for
memfd.  This would need to be aligned by stakeholders, including the UAPI
for fdbox.  Pratyush will be looking into fdbox or inventing something
similar while working on memfd.

We need to propose the fdbox design and UAPI.

----->o-----
David Matlack suggested a future topic: testing of the live update
process, including kexec and KHO.  It will be challenging to do full
integration testing with this, so low level selftests would be strongly
preferred as this becomes more mature.

----->o-----
Next meeting is scheduled for Monday, March 10 at 8am PDT (UTC-7).  Note
the time change due to Daylight Savings Time, this is now UTC-7 instead
of UTC-8.  I'll send a reminder on this mailing list.

Topics I think we should cover in the next meeting:

 - any objections to using debugfs as the initial interface for
   development and prototyping
 - update from Pratyush on implementation on memfd
 - update from Mike on implementation on memory reservation
 - design for fdbox and its use as a conceptual replacement for
   guestmemfs, gaining alignment within the group, and agreeing on its
   UAPI
 - decoupling 1GB pages for hugetlb, guest_memfd, and memfds and how fds
   can be added to an fdbox
 - establishing an FSM for all of the various states that are agreed upon
   with common language (when memory mappings can happen, what is
   disallowed at certain stages)
 - extending the above topic on a separate FSM for the entire live update
   process (what happens in brownout, blackout, shutdown, etc)
 - iommufd patch series (as well as qemu) from James
 - establishing an API for callbacks into drivers to serialize state
   during brownout
 - topics proposed by Pasha: reducing blackout window, relaxed
   serialization, and KHO activation requirements
 - implications of preserving vIOMMU state
 - testing methodology for these components, including selftests

Please let me know if you'd like to propose additional topics for
discussion, thank you!