[Hypervisor Live Update] Notes from March 10, 2025

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi everybody,

Here are the notes from the last Hypervisor Live Update call that happened 
on Monday, March 10.  Thanks for everybody who was involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
Mike discussed taking Jason's proposal in response to v4 with xarray and
extending it a bit for memory reservations, it appeared to be working
correctly.  He's hoping to have the first implementation for that by this
week.

Mike noted that the next KHO series was being prepared to be sent out
before LSF/MM/BPF, including device tree.

----->o-----
Mike noted that Pratyush found that the KHO scratch area does not work
well with swiotlb[1].  The scratch area is reserved before the swiotlb is
initialized; the second kernel doesn't have enough low memory for the
swiotlb because it's still allocated from memblock.  The current scratch
areas are allocated in higher memory.  Mike posted a patch series that
split meminit for all architectures so this should be easier to fix.
This will affect any driver that requires memory in the first 4GB.

Alexander suggested allocating a scratch region in the low memory area.
Pratyush proposed this as a solution, although he wondered if it would be
possible to move swiotlb allocations to after the buddy allocator was up.
Heuristics were discussed to determine how much memory should be reserved
in low memory for this.

Mike noted that for successive kexecs, there will be multiple regions of
scratch area for each NUMA node.  For low memory, this would be sized
suitable for allocations that must originate below 4GB for DMA.  Mike
said a solution would still need to be developed for overlap with
preserved scratch memory and Pratyush noted that should be explicit by
denying those reservations.

Pasha asked how drivers would know if reservations would be denied in the
first 4GB of memory.  Mike said an error code would be returned.  Pasha
was specific about devices that wanted to preserve the memory because
they knew DMA would be on-going during the reboot.  This became a more
general question: what devices should we support for KHO and what should
we not (what is considered too legacy?).  In the meantime, Pratyush
suggested explicit checks for this.

----->o-----
We shifted to talking about Pratyush's patch series supporting fdbox for
memfd[2].  Reaction was mixed to this: some feedback focused on the use
of miscdevice and there were security concerns.  Pratyush noted that
there was no intent to propose this as a generic concept outside KHO.

Pratyush noted there was no way to preserve folio orders in KHO and he
also noted there was a need for page flags.  He also said it would be
possible to move away from miscdevice and perhaps toward VFS but would
need to look more into this.

Pasha asked about how the page flags were preserved.  Pratyush said there
was another property that would store them currently.

Pasha asked how cgroups would be handled, but there was no current
support for that.  Pratyush said the current RFC focused on anon memfd
and has not yet looked at hugetlb.  Pasha emphasized the importance of
focusing on one type of memory to start.

Pratyush noted in chat: "With FDBox work, I also realized that you can't
use FDT code from modules. Should not really be a problem since we can
export those symbols I suppose, but it doesn't work _currently_ at
least".

----->o-----
Andrey had recently sent another patch series for KSTATE[3] that was
discussed, now in v2, which was closer to being a formal submission
rather than an RFC.  He noted his concern with KHO was how hard it was to
write serialization code.  His goal was to give drivers the ability to
migrate structs across kexec which could be more elegant (see the
struct kstate_description).  He suggested this would be more
maintainable.  It had previously been used for live migration in qemu.

Andrey noted that each description would have a version field that
enables defining the minimal supported version for each driver.  He made
the connection between this and version control in qemu.  Pasha asked how
this solves the problem when memory becomes sufficiently fragmented and
the next kernel cannot boot due to it; Andrey noted the kexec would fail.
Andrey suggested allocating a big contiguous area, the source and
destination ranges would be the same.

Mike noted that kstate_description definitions and the way drivers
declare their state to preserve are independent from scratch memory
reservation.  Andrey noted this wasn't a replacement for KHO but rather
could be built on top of KHO.

Mike suggested on top of KHO we have FDT, then what Pasha is proposing
for dynamic tree on top of that, then perhaps kstate on top of that.  He
would need to look more into kstate.

----->o-----
Mike asked if kstate descriptions depend on how it's preserved on the
backend, an earlier version had a migration stream.  Andrey suggested
using FDT underneath, but there is no strong dependency.

Pasha asked what architectures were supported today for kstate, Andrey
said x86.  Pasha suggested that anything that lands upstream should
likely support both x86 and ARM.

Chris Li asked about kstate descriptions and if a struct adds or removes
a member.  Andrey said if you want to add a new member, then you can bump
the version number.  He showed an example from qemu[4] that could be used
as reference for this.  You could also add a new kstate description with
a new id, on downgrade it wouldn't be used for backward compatibility.

Alexander suggested starting with FDT logic because it already exists and
then serialize and de-serialize binary data using a UAPI.  Then, we
should discuss deprecating FDT if/when we have something better.  That
won't be problematic unless we gain hundreds of users.  He emphaszied we
should focus on how to easily and quickly preserve memory across kexec,
calling back to drivers to store their state at the right time, etc.  The
data format for how to serialize is a tiny detail in comparison.  Pasha
fully agreed with this.

----->o-----
Next meeting is PREEMPTED for LSF/MM/BPF 2025 in Montreal.  So the next 
meeting will be on Monday, April 7 at 8am PDT (UTC-7).  I'll send a 
reminder on this mailing list.

Topics I think we should cover in the next meeting:

 - debrief discussions at LSF/MM/BPF 2025
 - update on Mike's patch series for memory reservation
 - update on Pratyush's progress for allocating swiotlb in low memory
   regions and any additional support required based on device
   requirements (who needs this scratch support?)
 - discuss whether the fdbox support would obsolete the need for
   guestmemfs in the long term
 - alignment on memblock as the first use case for KHO to justify
   upstreaming, including ftrace use cases
 - discuss Live Update Orchestrater (LUO) based on RFC patches sent by
   Pasha before then that helps to define the state machine
 - discuss how KSTATE plays into KHO upstreaming and complementary or
   overlapping goals
 - decoupling 1GB pages for hugetlb, guest_memfd, and memfds and how fds
   can be added to an fdbox
 - iommufd patch series (as well as qemu) from James
 - establishing an API for callbacks into drivers to serialize state
   during brownout
 - topics proposed by Pasha: reducing blackout window, relaxed
   serialization, and KHO activation requirements
 - implications of preserving vIOMMU state
 - testing methodology for these components, including selftests

Please let me know if you'd like to propose additional topics for
discussion, thank you!

[1] https://lore.kernel.org/all/mafs0cyf4ii4k.fsf@xxxxxxxxxx
[2] 
https://lore.kernel.org/lkml/20250307005830.65293-5-ptyadav@xxxxxxxxx/T/
[3] 
https://lore.kernel.org/linux-mm/20250310120318.2124-6-arbn@xxxxxxxxxxxxxxx/T/
[4] https://www.qemu.org/docs/master/devel/migration/main.html#vmstate




[Index of Archives]     [LM Sensors]     [Linux Sound]     [ALSA Users]     [ALSA Devel]     [Linux Audio Users]     [Linux Media]     [Kernel]     [Gimp]     [Yosemite News]     [Linux Media]

  Powered by Linux