Hi everybody, Here are the notes from the last Hypervisor Live Update call that happened on Monday, February 10. Thanks for everybody who was involved! These notes are intended to bring people up to speed who could not attend the call as well as keep the conversation going in between meetings. ----->o----- James mentioned guest memory persistence and the future of guestmemfs, including feedback to allow for more prototyping. We didn't get into this topic during the call, so we'll touch on it in the next call. ----->o----- Mike brought up the sysfs interface, whether or not we want an activate or not, as well as aligning with the devicetree feedback that was received on the last upstream posting for KHO. Every new binding would need to go through their code review and Jason noted the scalability and flexibility concerns for this. Jason noted that older kernels can ignore newer devicetree components and everything still works, which is different in the cases of live update. He suggested a much stronger compatibility for live update purposes between pairs of two kernel versions. Andrey agreed that we don't really need devicetree here and pointed to a patch series he had developed that doesn't rely on this. Pasha agreed that for the short/medium term it may make sense to decouple this from devicetree. Alexander noted that FDT was chosen deliberately: we want a generic key,value store and the ability to add attributes without invalidating compatibility. Andrey noted this can be done without devicetree. There was discussion on using this as a KHO-tree and not precisely a devicetree. Schema validation was another attractive characteristic of FDT. Alexander noted the current discussion has been focused on nodes and sub-nodes as a structure based on runtime data: the decision had to be made between structured data (incl debugability) and data that is always in memory and gets translated from one kernel to another. Jason thought we needed both. Pasha noted Intel in 2021 had preserved VFIO passthrough devices using PKRAM. The PKRAM patches and its interfaces turned out to be very difficult to maintain, given PKRAM did not maintain ABIs between kernels. It relied heavily on developer insight to specify what needed to be maintained in yaml files. Mike suggested that we'd need to be able to allow drivers the flexibility to point to an area of memory, a struct, or a scalar. James discussed serializing all inodes in guestmemfs with KHO in previous work, and this turned out to be very useful. Newer kernels were able to add new fields and move things around, but the downgrade path wasn't supported using this. Jason noted there were similarities between stable ABIs provided to userspace and filesystems. He stressed that complexity here for driver maintainers may become too burdensome. Jason suggested starting simple: structs pointing to structs pointing to structs. Have drivers that have versions 1, 2, etc, and allow for this to become more complex when needed. There was a general desire expressed to not maintain the kernel direct map and the virtual address space would end up getting scrambled, and that must be supported. Alexander noted that KHO's strategy so far has been that FDT has been the standard for compatibility and the usage of it versus other solutions depends on the specific use cases. We'd need to extend tooling to do validation in the future. Alexander noted that after writing 1 to the activate file that you'll grab a snapshot of the device tree from sysfs and that can be used for validation. ----->o----- We shifted to discussion on KHO v5. Mike noted that there was a sysfs interface that enables KHO and then the KHO data (devicetree and scratch description) gets appended to kexec images. Only the scratch space would be touched by the new kernel, not all memory is preserved, although it is in the kernel direct map. There was a discussion about using a bitmap (or an idr) to indicate what memory should be removed from the buddy allocator during early boot. Alexander noted you'd still need to be able to associate that bitmap or data structure with the specific driver that needs to find its memory. Jason noted this would be the driver's responsibility. Jason stressed how this would be used to establish the ABI, for example if a driver does alloc_pages(), store memory, and then use to_kho(), this is a nice clean interface to preserve driver memory. Doing things like GFP_KHO would be more invasive for this. ----->o----- Pasha led a discussion on the next KHO series to be sent upstream and alignment between people in the call. Pasha suggest we don't want to have kexec file load as part of the KHO process and rather these should be decoupled from each other. We want to minimize the blackout window as much as possible. If the VM is still running while doing KHO activate, we'd need to prevent any operation from changing this state that limits the VM functionality. Pasha wanted kexec file load to be completely decoupled from KHO. Alexander noted the point of the activate phase is to accelerate the kexec so that we can serialize state, goal being to keep 99% of VM operations still possible. Pasha noted that some devices need to be preserved across kexec but others do not need to. Jason suggested not coupling this with a global activate state, in that case, and Alexander agreed on allowing certain drivers to participate and not necessarily all. Jason stressed that we need to all agree on the state machine given the discussion two weeks ago. ----->o----- Next meeting is scheduled for Monday, February 24 at 8am PST (UTC-8). I'll send a reminder on this mailing list. Topics I think we should cover in the next meeting: - the future of guestmemfs and what it becomes, including alignment so prototyping can be done - Andrey's patch series that didn't rely on devicetree - alignment on not preserving the kernel direct map and using different virtual addresses in the new kernel - v5 of the KHO patch series with minor fixes - establishing an FSM for all of the various states that are agreed upon with common language (when memory mappings can happen, what is disallowed at certain stages) - extending the above topic on a separate FSM for the entire live update process (what happens in brownout, blackout, shutdown, etc) - iommufd patch series (as well as qemu) from James - establishing an API for callbacks into drivers to serialize state during brownout - topics proposed by Pasha: reducing blackout window, relaxed serialization, KHO activation requirements, and decoupling KHO from kexec - implications of preserving vIOMMU state Please let me know if you'd like to propose additional topics for discussion, thank you!