Hi everybody, Here are the notes from the last Hypervisor Live Update call that happened on Monday, March 10. Thanks for everybody who was involved! These notes are intended to bring people up to speed who could not attend the call as well as keep the conversation going in between meetings. ----->o----- Mike discussed taking Jason's proposal in response to v4 with xarray and extending it a bit for memory reservations, it appeared to be working correctly. He's hoping to have the first implementation for that by this week. Mike noted that the next KHO series was being prepared to be sent out before LSF/MM/BPF, including device tree. ----->o----- Mike noted that Pratyush found that the KHO scratch area does not work well with swiotlb[1]. The scratch area is reserved before the swiotlb is initialized; the second kernel doesn't have enough low memory for the swiotlb because it's still allocated from memblock. The current scratch areas are allocated in higher memory. Mike posted a patch series that split meminit for all architectures so this should be easier to fix. This will affect any driver that requires memory in the first 4GB. Alexander suggested allocating a scratch region in the low memory area. Pratyush proposed this as a solution, although he wondered if it would be possible to move swiotlb allocations to after the buddy allocator was up. Heuristics were discussed to determine how much memory should be reserved in low memory for this. Mike noted that for successive kexecs, there will be multiple regions of scratch area for each NUMA node. For low memory, this would be sized suitable for allocations that must originate below 4GB for DMA. Mike said a solution would still need to be developed for overlap with preserved scratch memory and Pratyush noted that should be explicit by denying those reservations. Pasha asked how drivers would know if reservations would be denied in the first 4GB of memory. Mike said an error code would be returned. Pasha was specific about devices that wanted to preserve the memory because they knew DMA would be on-going during the reboot. This became a more general question: what devices should we support for KHO and what should we not (what is considered too legacy?). In the meantime, Pratyush suggested explicit checks for this. ----->o----- We shifted to talking about Pratyush's patch series supporting fdbox for memfd[2]. Reaction was mixed to this: some feedback focused on the use of miscdevice and there were security concerns. Pratyush noted that there was no intent to propose this as a generic concept outside KHO. Pratyush noted there was no way to preserve folio orders in KHO and he also noted there was a need for page flags. He also said it would be possible to move away from miscdevice and perhaps toward VFS but would need to look more into this. Pasha asked about how the page flags were preserved. Pratyush said there was another property that would store them currently. Pasha asked how cgroups would be handled, but there was no current support for that. Pratyush said the current RFC focused on anon memfd and has not yet looked at hugetlb. Pasha emphasized the importance of focusing on one type of memory to start. Pratyush noted in chat: "With FDBox work, I also realized that you can't use FDT code from modules. Should not really be a problem since we can export those symbols I suppose, but it doesn't work _currently_ at least". ----->o----- Andrey had recently sent another patch series for KSTATE[3] that was discussed, now in v2, which was closer to being a formal submission rather than an RFC. He noted his concern with KHO was how hard it was to write serialization code. His goal was to give drivers the ability to migrate structs across kexec which could be more elegant (see the struct kstate_description). He suggested this would be more maintainable. It had previously been used for live migration in qemu. Andrey noted that each description would have a version field that enables defining the minimal supported version for each driver. He made the connection between this and version control in qemu. Pasha asked how this solves the problem when memory becomes sufficiently fragmented and the next kernel cannot boot due to it; Andrey noted the kexec would fail. Andrey suggested allocating a big contiguous area, the source and destination ranges would be the same. Mike noted that kstate_description definitions and the way drivers declare their state to preserve are independent from scratch memory reservation. Andrey noted this wasn't a replacement for KHO but rather could be built on top of KHO. Mike suggested on top of KHO we have FDT, then what Pasha is proposing for dynamic tree on top of that, then perhaps kstate on top of that. He would need to look more into kstate. ----->o----- Mike asked if kstate descriptions depend on how it's preserved on the backend, an earlier version had a migration stream. Andrey suggested using FDT underneath, but there is no strong dependency. Pasha asked what architectures were supported today for kstate, Andrey said x86. Pasha suggested that anything that lands upstream should likely support both x86 and ARM. Chris Li asked about kstate descriptions and if a struct adds or removes a member. Andrey said if you want to add a new member, then you can bump the version number. He showed an example from qemu[4] that could be used as reference for this. You could also add a new kstate description with a new id, on downgrade it wouldn't be used for backward compatibility. Alexander suggested starting with FDT logic because it already exists and then serialize and de-serialize binary data using a UAPI. Then, we should discuss deprecating FDT if/when we have something better. That won't be problematic unless we gain hundreds of users. He emphaszied we should focus on how to easily and quickly preserve memory across kexec, calling back to drivers to store their state at the right time, etc. The data format for how to serialize is a tiny detail in comparison. Pasha fully agreed with this. ----->o----- Next meeting is PREEMPTED for LSF/MM/BPF 2025 in Montreal. So the next meeting will be on Monday, April 7 at 8am PDT (UTC-7). I'll send a reminder on this mailing list. Topics I think we should cover in the next meeting: - debrief discussions at LSF/MM/BPF 2025 - update on Mike's patch series for memory reservation - update on Pratyush's progress for allocating swiotlb in low memory regions and any additional support required based on device requirements (who needs this scratch support?) - discuss whether the fdbox support would obsolete the need for guestmemfs in the long term - alignment on memblock as the first use case for KHO to justify upstreaming, including ftrace use cases - discuss Live Update Orchestrater (LUO) based on RFC patches sent by Pasha before then that helps to define the state machine - discuss how KSTATE plays into KHO upstreaming and complementary or overlapping goals - decoupling 1GB pages for hugetlb, guest_memfd, and memfds and how fds can be added to an fdbox - iommufd patch series (as well as qemu) from James - establishing an API for callbacks into drivers to serialize state during brownout - topics proposed by Pasha: reducing blackout window, relaxed serialization, and KHO activation requirements - implications of preserving vIOMMU state - testing methodology for these components, including selftests Please let me know if you'd like to propose additional topics for discussion, thank you! [1] https://lore.kernel.org/all/mafs0cyf4ii4k.fsf@xxxxxxxxxx [2] https://lore.kernel.org/lkml/20250307005830.65293-5-ptyadav@xxxxxxxxx/T/ [3] https://lore.kernel.org/linux-mm/20250310120318.2124-6-arbn@xxxxxxxxxxxxxxx/T/ [4] https://www.qemu.org/docs/master/devel/migration/main.html#vmstate