Hi everybody, Here are the notes from the last Hypervisor Live Update call that happened on Monday, February 24. Thanks for everybody who was involved! These notes are intended to bring people up to speed who could not attend the call as well as keep the conversation going in between meetings. ----->o----- We discussed the KHO v5 patch series that would be posted shortly. Two major changes were noted: decoupling of KHO activation from kexec load (no KHO finalize stage triggered from userspace or kexec reboot), and more dynamic internal representation of the tree that gets serialized into the FDT at the end of the process. The memory preservation mechanism did not change. Specifically, we were discussing the elements that were proposed for sysfs: the activation interface, FDT blob, dt max, and scratch area definitions. Jason noted a general concern on all of the UAPIs that are being developed and suggested scaling them back. Mike noted also noted a concern with the activation trigger: we may want it to represent the state in the KHO state machine. He also agreed with Jason that dt max may want to be removed altogether. The suggestion would be to hard code for prototyping, Mike said this should not be in sysfs. Mike noted that recent changes in KHO would do vmap and vmalloc in kexec reboot. Mike said that we need to have a point where state becomes immutable and Jason suggested a simple read from sysfs, i.e. when you read byte zero of the sysfs file the kernel would go and do the callback. This could at least be used early in development. ----->o----- Jason suggested that we agree on the FSM first because of unpredictability with future changes, especially for UAPI. I suggested that we align this across stakeholders as soon as possible, especially before going into v6 or v7 of KHO. In this case, I was referring to both state machines for the live update process as well as for activation. Jason pushed to understand what the scope of the v5 that will be sent out will be. Is the objective to get some things merged as a foot in the door, or is it something that is complete and does everything we intend to do? Mike suggested getting started with something that was minimal to address concerns from devicetree and kexec communities. Jason strongly suggested minimizing the UAPI in this case and relying on the dt blob for now. I suggested perhaps we should use debugfs for now instead of relying on sysfs where we have more flexibility for changes. Jason noted this may not be sysfs in the end anyway, we might have different interfaces later. Mike noted that Alex may have a strong opinion about this, but would suspect that this is ok for debugfs. ----->o----- Mike discussed playing around with hugetlb, Jason suggested causing the fd's to round trip through the kexec. If hugetlb is thrown into fdbox, then this informs the kernel that this memory should be preserved. This would be similar for memfd. Jason suggested against something like guestmemfs and pointed in the direction of fdbox instead. David Matlack asked what would be the point of the fdbox for things already in a filesystem. Jason noted this was for the concept of creating the inode in the first place (could be an anonymous file description), then we give it a label, and expect it to be there on the other side of the kexec. Instead of hugetlbfs, this might be a memfd that supports 1GB hugepages; additionally, guest_memfd development has been trying to untangle 1GB hugepages from guest_memfd in general. We discussed what metadata should be preserved, independent of whether it is regular memfd or guest_memfd. The other side of the kexec would do memfd_create() to restore the filesystem on the new kernel. Pratyush took the AI to look at implementation on memfd. Mike took the AI to look at memory reservation. ----->o----- I pivoted the discussion toward fdbox which ended up never being posted upstream. Pratyush provided the link to the most recent code: https://github.com/agraf/linux-2.6/blob/kvm-kho-gmem-test/drivers/misc/fdbox.c This work likely would need to be picked up to pursue upstream because the current code has several TODO's. (A struct miscdevice was called out as curious.) It was noted that it would be difficult to support memfd without something like fdbox, so whether the fdbox code itself were upstreamed or it becomes more generic from the work on memfd, the base support would need to be provided somehow. Mike and Jason suggested designing fdbox and then starting to use it for memfd. This would need to be aligned by stakeholders, including the UAPI for fdbox. Pratyush will be looking into fdbox or inventing something similar while working on memfd. We need to propose the fdbox design and UAPI. ----->o----- David Matlack suggested a future topic: testing of the live update process, including kexec and KHO. It will be challenging to do full integration testing with this, so low level selftests would be strongly preferred as this becomes more mature. ----->o----- Next meeting is scheduled for Monday, March 10 at 8am PDT (UTC-7). Note the time change due to Daylight Savings Time, this is now UTC-7 instead of UTC-8. I'll send a reminder on this mailing list. Topics I think we should cover in the next meeting: - any objections to using debugfs as the initial interface for development and prototyping - update from Pratyush on implementation on memfd - update from Mike on implementation on memory reservation - design for fdbox and its use as a conceptual replacement for guestmemfs, gaining alignment within the group, and agreeing on its UAPI - decoupling 1GB pages for hugetlb, guest_memfd, and memfds and how fds can be added to an fdbox - establishing an FSM for all of the various states that are agreed upon with common language (when memory mappings can happen, what is disallowed at certain stages) - extending the above topic on a separate FSM for the entire live update process (what happens in brownout, blackout, shutdown, etc) - iommufd patch series (as well as qemu) from James - establishing an API for callbacks into drivers to serialize state during brownout - topics proposed by Pasha: reducing blackout window, relaxed serialization, and KHO activation requirements - implications of preserving vIOMMU state - testing methodology for these components, including selftests Please let me know if you'd like to propose additional topics for discussion, thank you!