Re: [LSF/MM/BPF TOPIC] memory persistence over kexec

Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> · Sun, 26 Jan 2025 15:41:11 -0500

On Sun, Jan 26, 2025 at 3:04 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>
> On Sat, Jan 25, 2025 at 10:19:51AM -0500, Pasha Tatashin wrote:
>
> > One way to solve that is pre-reserving space for the KHO tree -
> > ideally a reasonable amount, perhaps 32-64 MB and allocating it at
> > kexec load time.
>
> Why is there any weird limit?

Setting a limit for KHO trees is similar to the limit we set for the
scratch area; we can overrun both. It is just one simple way to ensure
serialization is possible after kexec load, but there are obviously
other ways to solve this problem."

> We are preserving hudreds of GB of pages
> backing the VM and more. There is endless memory being preserved across?

There are other ways to do that, but even with this limit, I do not
see this as an issue. The gigabytes of pages backing VMs would not be
scattered as individual 4K pages; that's simply inefficient. The
number of physical ranges is going to be small. If the preserved data
is so large that it cannot fit into a reasonably sized tree, then I
claim that the data should not be saved directly in the tree. Instead,
it should have its own metadata that is pointed to from the tree.

Alternatively, we could allow allocate FDT tree during kernel shutdown
time. At that time there should be plenty of free memory as we already
finished with userland. However, we have to be careful to allocate
from memory that does not overlap the area where kernel segments and
initramfs are going to be relocated.

> So why are we trying to shoehorn a bunch of KHO stuff into the DT?
> Shouldn't the DT just have a small KHO info pointing to the real KHO
> memory in normal pages?

Yes, for entities like file systems, there absolutely should be a
small KHO info entry pointing to metadata pages that preserve the
normal pages. However, for devices that are kept alive, most of the
data should be saved directly in the tree, unless there is a large
sparse soft state that must be carried for some reason (i.e. network
flows or something similar)

> Even if you want to re-use DT as some kind of serializing scheme in
> drivers the DT framework can let each driver build its own tree,
> serialize it to its own memory and then just link a pointer to that
> tree.
>
> Also, I'm not sure forcing using DT as a serializing scheme is a great
> idea. It is complicated and doesn't do that much to solve the complex
> versioning problem drivers face here..

The primary goal of the KHO device tree is to standardize the
live-update metadata that drivers preserve to maintain device
functionality across reboots. We will document this using the YAML
binding format, similar to our current approach for cold boot and
getting device tree from firmware. Otherwise, we could just use other
methods such as PKRAM where it no inherent standardization involved,
but that allows to serialize devices absolutely during any phase of
reboot.

Pasha