On Thu, 6 Feb 2025 at 21:34, Mike Rapoport <rppt@xxxxxxxxxx> wrote: > > From: "Mike Rapoport (Microsoft)" <rppt@xxxxxxxxxx> > > Hi, > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers" > series (https://lore.kernel.org/all/20240117144704.602-1-graf@xxxxxxxxxx), > just to make things simpler instead of ftrace we decided to preserve > "reserve_mem" regions. > > The patches are also available in git: > https://git.kernel.org/rppt/h/kho/v4 > > > Kexec today considers itself purely a boot loader: When we enter the new > kernel, any state the previous kernel left behind is irrelevant and the > new kernel reinitializes the system. > > However, there are use cases where this mode of operation is not what we > actually want. In virtualization hosts for example, we want to use kexec > to update the host kernel while virtual machine memory stays untouched. > When we add device assignment to the mix, we also need to ensure that > IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we > need to do the same for the PCI subsystem. If we want to kexec while an > SEV-SNP enabled virtual machine is running, we need to preserve the VM > context pages and physical memory. See "pkernfs: Persisting guest memory > and kernel/device state safely across kexec" Linux Plumbers > Conference 2023 presentation for details: > > https://lpc.events/event/17/contributions/1485/ > > To start us on the journey to support all the use cases above, this patch > implements basic infrastructure to allow hand over of kernel state across > kexec (Kexec HandOver, aka KHO). As a really simple example target, we use > memblock's reserve_mem. > With this patch set applied, memory that was reserved using "reserve_mem" > command line options remains intact after kexec and it is guaranteed to > reside at the same physical address. > > == Alternatives == > > There are alternative approaches to (parts of) the problems above: > > * Memory Pools [1] - preallocated persistent memory region + allocator > * PRMEM [2] - resizable persistent memory regions with fixed metadata > pointer on the kernel command line + allocator > * Pkernfs [3] - preallocated file system for in-kernel data with fixed > address location on the kernel command line > * PKRAM [4] - handover of user space pages using a fixed metadata page > specified via command line > > All of the approaches above fundamentally have the same problem: They > require the administrator to explicitly carve out a physical memory > location because they have no mechanism outside of the kernel command > line to pass data (including memory reservations) between kexec'ing > kernels. > > KHO provides that base foundation. We will determine later whether we > still need any of the approaches above for fast bulk memory handover of for > example IOMMU page tables. But IMHO they would all be users of KHO, with > KHO providing the foundational primitive to pass metadata and bulk memory > reservations as well as provide easy versioning for data. > > == Overview == > > We introduce a metadata file that the kernels pass between each other. How > they pass it is architecture specific. The file's format is a Flattened > Device Tree (fdt) which has a generator and parser already included in > Linux. When the root user enables KHO through /sys/kernel/kho/active, the > kernel invokes callbacks to every driver that supports KHO to serialize > its state. When the actual kexec happens, the fdt is part of the image > set that we boot into. In addition, we keep a "scratch regions" available > for kexec: A physically contiguous memory regions that is guaranteed to > not have any memory that KHO would preserve. The new kernel bootstraps > itself using the scratch regions and sets all handed over memory as in use. > When drivers initialize that support KHO, they introspect the fdt and > recover their state from it. This includes memory reservations, where the > driver can either discard or claim reservations. > > == Limitations == > > Currently KHO is only implemented for file based kexec. The kernel > interfaces in the patch set are already in place to support user space > kexec as well, but it is still not implemented it yet inside kexec tools. > What architecture exactly does this KHO work fine? Device Tree should be ok on arm*, x86 and power*, but how about s390? Thanks Dae