Re: [LSF/MM/BPF TOPIC] memory persistence over kexec

Jason Gunthorpe <jgg@xxxxxxxx> · Fri, 24 Jan 2025 10:56:31 -0400

On Fri, Jan 24, 2025 at 01:30:52PM +0200, Mike Rapoport wrote:
> Hi Jason,
> 
> On Mon, Jan 20, 2025 at 10:14:27AM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote:
> > > Hi,
> > > 
> > > I'd like to discuss memory persistence across kexec.
> > > 
> > > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows
> > > serialization and deserialization of kernel data as well as preserving
> > > arbitrary memory ranges across kexec.
> > > 
> > > In addition, KHO keeps a physically contiguous memory regions that are
> > > guaranteed to not have any memory that KHO would preserve, but still can be
> > > used by the system. The kexeced kernel bootstraps itself using those
> > > regions and sets all handed over memory as in use. KHO users then can
> > > recover their state from the preserved data. This includes memory
> > > reservations, where the user can either discard or claim reservations.
> > > 
> > > KHO can be used as the base layer for implementation of persistence-aware
> > > memory allocator and persistent in-memory filesystem.
> > > 
> > > Aside from status update on KHO progress there are a few topics that I would
> > > like to discuss:
> > > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs?
> > > * Or is it better to implement yet another in-memory filesystem dedicated
> > >   for persistence?
> > > * What is the best way to ensure that the memory we want to persist is not
> > >   scattered all over the place?
> > 
> > There is alot of talk about taking *drivers* and having them survive
> > kexec, meaning the driver has to put alot of its state into KHO and
> > then get it back out again.
> > 
> > I've been hoping for a model where a driver can be told to "go to KHO"
> > and the KHO code can be largely contained in the driver and regulated
> > to recording the driver state. This implies the state may be
> > fragmented all over memory.
> 
> I'm not sure I follow what do you mean by "go to KHO" here.

Drawing on our now extensive experiance with PCI device live
migration, I imagine a state progression approximately like:

RUNNING - minimal or no KHO involvement
PREPARE - KHO stuff starts to get ready, preallocations, loading
          successor kernels, etc. No VM degradation
PRE-STOP - KHO gets serious, stuff starts to become unavailable,
           userspace needs to shut things down and get ready. Some
           level of VM degradation - ie changing IOMMU translations
	   may block the VM until CONCLUDE.
STOP - Now you've done it. KHO state is finalized - VMs stop running
KEXEC - Weee - VMs not running
RESUME - Get booted up, get ready to start up the VMs - VM still stopped
POST-RESUME - Start unpacking more stuff from KHO, userspace starts
              bringing back other stuff it may have shutdown. Some
	      level of VM degradation
CONCLUDE - Discard all the remaining KHO stuff. No VM degradation
RUNNING - minimal or no KHO involvment

Each of these states should inform drivers/etc when we reach them, and
the KHO state that will survive the kexec evolves and extends as it
progress.

So "go to KHO" would refer to a driver that is using PREPARE and
PRE-STOP to start moving its functionality from normal memory to KHO
preserved memory, possibly with some functional degradation.

> I believe that ftrace example in Alex's v3 of KHO
> (https://lore.kernel.org/all/20240117144704.602-1-graf@xxxxxxxxxx)
> has enough meat to demonstrate the basic model.

ftrace is just too simple to capture the full complexity of what a
real HW device would need. We've now spent time thinking about what it
would take to make a complex NIC survive kexec and I suggest the above
model for how to approach it.

> > The other direction is that the driver has to start up in some special
> > KHO mode and KHO becomes invasive on all driver paths to use special
> > KHO allocations. This seems like a PITA.
> > 
> > You can see this difference just in the discussion around the iommu
> > serialization where one idea was to have KHO be an integral (and
> > invasive!) part of the page table operations from time zero vs some
> > later serialization at kexec time.
> 
> I didn't follow that discussion closely, but there still should be a step
> when iommu driver would try to deserialize the data and use it if
> deserialization succeeds.

There were two options, one is that the iommu always lives in KHO, the
other is that the iommu moves (ie go to KHO) into KHO.

For instance asumming the latter, as you progress through the above
state list:

RUNNING - IOMMU page tables are in normal memory and normal IOMMU code
 	  is used to manipulate them
PREPARE - We allocate an approximate amount of KHO memory needed to hold
	  the page tables
PRE-STOP - The page tables are copied into the KHO memory and frozen
           to be unchanging
STOP - The IOMMU driver records to KHO which devices have KHO page
       tables
RESUME - The IOMMU driver recovers the KHO page tables and hitlessly
         sets up the new HW lookup tables to use them
POST-RESUME - The page tables are copied out of the KHO memory and
              back to normal memory where normal IOMMU algorithms can run
              them
CONCLUDE - All the KHO memory is freed

Compared to the first option, we'd somehow teach the IOMMU code to
always use KHO for allocations, and KHO is somehow compatible and
preserving the IOMMU's use of struct page metadata. Avoids the
serializing copy, but you have to make invasive KHO changes to the
existing IOMMU page table code.

vs serialize which could be isolated to a KHO module that doesn't
bother anyone else.

[Also, I would prefer to see KHO updates to page table code after
consolidating the iommu page table code in one place. Could use some
help on that project too :)

https://patch.msgid.link/r/0-v1-01fa10580981+1d-iommu_pt_jgg@xxxxxxxxxx
]

> My understanding it that a major part of the complexity in iommu is the
> userspace facing bits that need to be somehow connected to the restored in
> kernel structures after kexec.

Yes certainly this is hard too. I have yet to see a complete
functional proposal for this.

I have been feeling that KHO should have a way to preserve a driver
file descriptor. Not a full descriptor, but something stripped back
and simplified. Getting a descriptor through KHO, vs /dev/XXX would
trigger special stuff like not FLRing VFIO PCI devices, not wrecking
the IOMMU translation and so on.

For instance for iommufd we may move the tables into KHO, destory all
other iommufd objects, then transfer the stripped down iommufd FD to
KHO. On resume the VMM would recover the KHO iommufd FD and rebuild
the lost objects, then destroy the special KHO page table.

The really tricky thing is there is *alot* of state in these FDs, some
we can imagine to retain, others will have to be rebuilt.

There is aslo alot of kernel actions that don't happen at FD open
time. Some kind of philosophy is needed here - what happens if the
kernel skips steps to preserve KHO, but the userspace doesn't follow
the KHO flow? Ie userspace opens /dev/vfio instead of the KHO version?
The /dev/vfio is pretty wrecked because of what KHO did. Does the
kernel have to fix it? Should the kernel forbid it? What happens if
KHO and KHO again without userspace fixing everything? So many
questions :\

Jason