On Fri, Jan 24, 2025 at 01:30:52PM +0200, Mike Rapoport wrote: > Hi Jason, > > On Mon, Jan 20, 2025 at 10:14:27AM -0400, Jason Gunthorpe wrote: > > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote: > > > Hi, > > > > > > I'd like to discuss memory persistence across kexec. > > > > > > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows > > > serialization and deserialization of kernel data as well as preserving > > > arbitrary memory ranges across kexec. > > > > > > In addition, KHO keeps a physically contiguous memory regions that are > > > guaranteed to not have any memory that KHO would preserve, but still can be > > > used by the system. The kexeced kernel bootstraps itself using those > > > regions and sets all handed over memory as in use. KHO users then can > > > recover their state from the preserved data. This includes memory > > > reservations, where the user can either discard or claim reservations. > > > > > > KHO can be used as the base layer for implementation of persistence-aware > > > memory allocator and persistent in-memory filesystem. > > > > > > Aside from status update on KHO progress there are a few topics that I would > > > like to discuss: > > > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs? > > > * Or is it better to implement yet another in-memory filesystem dedicated > > > for persistence? > > > * What is the best way to ensure that the memory we want to persist is not > > > scattered all over the place? > > > > There is alot of talk about taking *drivers* and having them survive > > kexec, meaning the driver has to put alot of its state into KHO and > > then get it back out again. > > > > I've been hoping for a model where a driver can be told to "go to KHO" > > and the KHO code can be largely contained in the driver and regulated > > to recording the driver state. This implies the state may be > > fragmented all over memory. > > I'm not sure I follow what do you mean by "go to KHO" here. Drawing on our now extensive experiance with PCI device live migration, I imagine a state progression approximately like: RUNNING - minimal or no KHO involvement PREPARE - KHO stuff starts to get ready, preallocations, loading successor kernels, etc. No VM degradation PRE-STOP - KHO gets serious, stuff starts to become unavailable, userspace needs to shut things down and get ready. Some level of VM degradation - ie changing IOMMU translations may block the VM until CONCLUDE. STOP - Now you've done it. KHO state is finalized - VMs stop running KEXEC - Weee - VMs not running RESUME - Get booted up, get ready to start up the VMs - VM still stopped POST-RESUME - Start unpacking more stuff from KHO, userspace starts bringing back other stuff it may have shutdown. Some level of VM degradation CONCLUDE - Discard all the remaining KHO stuff. No VM degradation RUNNING - minimal or no KHO involvment Each of these states should inform drivers/etc when we reach them, and the KHO state that will survive the kexec evolves and extends as it progress. So "go to KHO" would refer to a driver that is using PREPARE and PRE-STOP to start moving its functionality from normal memory to KHO preserved memory, possibly with some functional degradation. > I believe that ftrace example in Alex's v3 of KHO > (https://lore.kernel.org/all/20240117144704.602-1-graf@xxxxxxxxxx) > has enough meat to demonstrate the basic model. ftrace is just too simple to capture the full complexity of what a real HW device would need. We've now spent time thinking about what it would take to make a complex NIC survive kexec and I suggest the above model for how to approach it. > > The other direction is that the driver has to start up in some special > > KHO mode and KHO becomes invasive on all driver paths to use special > > KHO allocations. This seems like a PITA. > > > > You can see this difference just in the discussion around the iommu > > serialization where one idea was to have KHO be an integral (and > > invasive!) part of the page table operations from time zero vs some > > later serialization at kexec time. > > I didn't follow that discussion closely, but there still should be a step > when iommu driver would try to deserialize the data and use it if > deserialization succeeds. There were two options, one is that the iommu always lives in KHO, the other is that the iommu moves (ie go to KHO) into KHO. For instance asumming the latter, as you progress through the above state list: RUNNING - IOMMU page tables are in normal memory and normal IOMMU code is used to manipulate them PREPARE - We allocate an approximate amount of KHO memory needed to hold the page tables PRE-STOP - The page tables are copied into the KHO memory and frozen to be unchanging STOP - The IOMMU driver records to KHO which devices have KHO page tables RESUME - The IOMMU driver recovers the KHO page tables and hitlessly sets up the new HW lookup tables to use them POST-RESUME - The page tables are copied out of the KHO memory and back to normal memory where normal IOMMU algorithms can run them CONCLUDE - All the KHO memory is freed Compared to the first option, we'd somehow teach the IOMMU code to always use KHO for allocations, and KHO is somehow compatible and preserving the IOMMU's use of struct page metadata. Avoids the serializing copy, but you have to make invasive KHO changes to the existing IOMMU page table code. vs serialize which could be isolated to a KHO module that doesn't bother anyone else. [Also, I would prefer to see KHO updates to page table code after consolidating the iommu page table code in one place. Could use some help on that project too :) https://patch.msgid.link/r/0-v1-01fa10580981+1d-iommu_pt_jgg@xxxxxxxxxx ] > My understanding it that a major part of the complexity in iommu is the > userspace facing bits that need to be somehow connected to the restored in > kernel structures after kexec. Yes certainly this is hard too. I have yet to see a complete functional proposal for this. I have been feeling that KHO should have a way to preserve a driver file descriptor. Not a full descriptor, but something stripped back and simplified. Getting a descriptor through KHO, vs /dev/XXX would trigger special stuff like not FLRing VFIO PCI devices, not wrecking the IOMMU translation and so on. For instance for iommufd we may move the tables into KHO, destory all other iommufd objects, then transfer the stripped down iommufd FD to KHO. On resume the VMM would recover the KHO iommufd FD and rebuild the lost objects, then destroy the special KHO page table. The really tricky thing is there is *alot* of state in these FDs, some we can imagine to retain, others will have to be rebuilt. There is aslo alot of kernel actions that don't happen at FD open time. Some kind of philosophy is needed here - what happens if the kernel skips steps to preserve KHO, but the userspace doesn't follow the KHO flow? Ie userspace opens /dev/vfio instead of the KHO version? The /dev/vfio is pretty wrecked because of what KHO did. Does the kernel have to fix it? Should the kernel forbid it? What happens if KHO and KHO again without userspace fixing everything? So many questions :\ Jason