Re: [LSF/MM/BPF TOPIC] memory persistence over kexec

Alexander Graf <graf@xxxxxxxxxx> · Sun, 26 Jan 2025 16:21:05 -0800

On 26.01.25 12:41, Pasha Tatashin wrote:
On Sun, Jan 26, 2025 at 3:04 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
On Sat, Jan 25, 2025 at 10:19:51AM -0500, Pasha Tatashin wrote:

One way to solve that is pre-reserving space for the KHO tree -
ideally a reasonable amount, perhaps 32-64 MB and allocating it at
kexec load time.
Why is there any weird limit?
Setting a limit for KHO trees is similar to the limit we set for the
scratch area; we can overrun both. It is just one simple way to ensure
serialization is possible after kexec load, but there are obviously
other ways to solve this problem."

The problem is not only with allocation. Kexec has 2 schemes: User space 
and kernel based file loading. In the latter, we can do whatever we 
like. In the former, the flow expects user space has ultimate control 
over placement of the future data blobs and their contents.

I like the flexibility this allows for. It means that user space can 
inject its own KHO data for example if it wants to. Or modify it. It 
will come in very handy for debugging and testing later.

We are preserving hudreds of GB of pages
backing the VM and more. There is endless memory being preserved across?
There are other ways to do that, but even with this limit, I do not
see this as an issue. The gigabytes of pages backing VMs would not be
scattered as individual 4K pages; that's simply inefficient. The
number of physical ranges is going to be small. If the preserved data
is so large that it cannot fit into a reasonably sized tree, then I
claim that the data should not be saved directly in the tree. Instead,
it should have its own metadata that is pointed to from the tree.

Correct :). The way I think of the KHO DT is as a uniform way to 
implement setup_data across kexec that is identical across all 
architectures, enforces review and structure to ensure we keep 
compatibility and generalizes memory reservation.

The alternative we have today are hacks like IMA: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/uapi/asm/setup_data.h#n73

Alternatively, we could allow allocate FDT tree during kernel shutdown
time. At that time there should be plenty of free memory as we already
finished with userland. However, we have to be careful to allocate
from memory that does not overlap the area where kernel segments and
initramfs are going to be relocated.

Yes, this is easier said than done. In the user space driven kexec path, 
user space is in control of memory locations. At least after the first 
kexec iteration, these locations will overlap with the existing Linux 
runtime environment, because both lie in the scratch region. Only the 
purgatory moves everything to where it should be.

Maybe we could create a special kexec memory type that means "KHO DT"?

Alex