Hey Oleksij!
On 06.02.24 09:17, Oleksij Rempel wrote:
Hi Alexander,
Nice work!
On Wed, Jan 17, 2024 at 02:46:47PM +0000, Alexander Graf wrote:
Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.
However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See James' and my Linux Plumbers
Conference 2023 presentation for details:
https://lpc.events/event/17/contributions/1485/
To start us on the journey to support all the use cases above, this
patch implements basic infrastructure to allow hand over of kernel state
across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
With this patch set applied, you can read ftrace records from the
pre-kexec environment in your post-kexec one. This creates a very powerful
debugging and performance analysis tool for kexec. It's also slightly
easier to reason about than full blown VFIO state preservation.
== Alternatives ==
There are alternative approaches to (parts of) the problems above:
* Memory Pools [1] - preallocated persistent memory region + allocator
* PRMEM [2] - resizable persistent memory regions with fixed metadata
pointer on the kernel command line + allocator
* Pkernfs [3] - preallocated file system for in-kernel data with fixed
address location on the kernel command line
* PKRAM [4] - handover of user space pages using a fixed metadata page
specified via command line
All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command
line to pass data (including memory reservations) between kexec'ing
kernels.
KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of for
example IOMMU page tables. But IMHO they would all be users of KHO, with
KHO providing the foundational primitive to pass metadata and bulk memory
reservations as well as provide easy versioning for data.
== Overview ==
We introduce a metadata file that the kernels pass between each other. How
they pass it is architecture specific. The file's format is a Flattened
Device Tree (fdt) which has a generator and parser already included in
Linux. When the root user enables KHO through /sys/kernel/kho/active, the
kernel invokes callbacks to every driver that supports KHO to serialize
its state. When the actual kexec happens, the fdt is part of the image
set that we boot into. In addition, we keep a "scratch region" available
for kexec: A physically contiguous memory region that is guaranteed to
not have any memory that KHO would preserve. The new kernel bootstraps
itself using the scratch region and sets all handed over memory as in use.
When drivers initialize that support KHO, they introspect the fdt and
recover their state from it. This includes memory reservations, where the
driver can either discard or claim reservations.
== Limitations ==
I currently only implemented file based kexec. The kernel interfaces
in the patch set are already in place to support user space kexec as well,
but I have not implemented it yet inside kexec tools.
== How to Use ==
To use the code, please boot the kernel with the "kho_scratch=" command
line parameter set: "kho_scratch=512M". KHO requires a scratch region.
Make sure to fill ftrace with contents that you want to observe after
kexec. Then, before you invoke file based "kexec -l", activate KHO:
# echo 1 > /sys/kernel/kho/active
# kexec -l Image --initrd=initrd -s
# kexec -e
The new kernel will boot up and contain the previous kernel's trace
buffers in /sys/kernel/debug/tracing/trace.
Assuming:
- we wont to start tracing as early as possible, before rootfs
or initrd would be able to configure it.
- traces are stored on a different device, not RAM. For example NVMEM.
- Location of NVMEM is different for different board types, but
bootloader is able to give the right configuration to the kernel.
Let me try to really understand what you're tracing here. Are we talking
about exposing boot loader traces into Linux [1]? In that case, I think
a mechanism like [2] is what you're looking for.
Or do you want to transfer genuine Linux ftrace traces? In that case,
why would you want to store them outside of RAM?
What would be the best, acceptable for mainline, way to provide this
kind of configuration? At least part of this information do not
describes devices or device states, this would not fit in to devicetree
universe. Amount of possible information would not fit in to bootconfig
too.
We have precedence for configuration in device tree: You can use device
tree to describe partitions on a NAND device, you can use it to specify
MAC address overrides of devices attached to USB, etc etc. At the end of
the day when people say they don't want configuration in device tree,
what they mean is that device tree should be a hand over data structure
from firmware to kernel, not from OS integrator to kernel :). If your
firmware is the place that knows about offsets and you need to pass
those offsets, IMHO DT is a good fit.
Other more or less overlapping use case I have in mind is a netbootable
embedded system with a requirement to boot as fast as possible. Since
bootloader already established a link and got all needed ip
configuration, it would be able to hand over etherent controller and ip
configuration states. Wille be the KHO the way to go for this use case?
That's an interesting one too. I would lean towards "try with normal
device tree first" here as well. It's again a very clear case of
"firmware wants to tell OS about things it knows, but the OS doesn't
know" to me. That means device tree should be fine to describe it.
Alex
[1] https://www.youtube.com/watch?v=RaFm5FfzFaM /
https://edk2.groups.io/g/devel/topic/91368904
[2]
https://github.com/agraf/linux/commit/b1fe0c296ec923e9b1f544862b0eb9365a8da7cb
Regards,
Oleksij
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec