Re: [LSF/MM/BPF TOPIC] memory persistence over kexec

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Jan 25, 2025 at 4:53 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
>
> Hi Pasha,
>
> On Wed, Jan 22, 2025 at 06:30:22PM -0500, Pasha Tatashin wrote:
> > > > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote:
> > > > > Hi,
> > > > >
> > > > > I'd like to discuss memory persistence across kexec.
> > > > >
> >
> > Hi Mike,
> >
> > I'm very interested in this topic and can contribute both presenting
> > and implementing changes upstream. We're planning on using KHO in our
> > kernel at Google but there are some limitations for our use case that
> > I believe can be addressed.
> >
> > Limitations:
> >
> > 1. Serialization callbacks are called by KHO during the activation
> > phase in series. In most cases different device drivers are
> > independent, the serialization can be parallelized.
> >
> > 2. Once the serialization callbacks are done, the device tree data
> > cannot be altered and drivers cannot add more data into the device
> > tree (except limited modification where drivers can remember the exact
> > node that was created and modify some properties, but that is too
> > limited).
> > This is bad because we have use cases where we need to save buffered
> > data (not memory locations) into the device tree at some late stage
> > before jumping to the new kernel.
>
> The device tree data cannot be altered because at kexec load time it is
> appended to kexec image and that image cannot be altered without a new
> kexec load.

Right, this is how it is implemented now.

One way to solve that is pre-reserving space for the KHO tree -
ideally a reasonable amount, perhaps 32-64 MB and allocating it at
kexec load time. During shutdown, we would use this pre-allocated
space to convert the KHO sparse tree to FDT format. Performing kexec
load during a blackout period violates the hypervisor's live update
time requirements, and also prevents breaking serialization into
phases: i.e. pre-blackout during blackout, during shutdown etc.
Furthermore, for performance reasons serialization must be
parallelizable for live updates, which the FDT format does not
support. Since we can specify KHO scratch space which is the maximum
amount of memory needed for the next kernel, we can similarly specify
the maximum KHO tree size.

> > 3. KHO requires devices to be serialized before
> > kexec_file_load()/kexec_load(), which means that load becomes part of
> > the VM blackout window, if KHO is used for hypervisor live update
> > scenarios this is a very bad limitation.
>
> KHO data has to be a part of kexec image and the way kexec works now there
> is no way to add anything to kexec image after kexec load.
> To be able to serialize the state closer to kexec reboot we'd need to
> change the way kexec images are created, regardless of what data format
> we use to pass the data between kernels.
>
> > 4. KHO activation should not really be needed, there should be two
> > phases: old KHO tree passed from the old kernel, and once it is fully
> > consumed, new KHO tree that can be updated at any time by devices that
> > is going to be passed to the next kernel during next reboot (kexec or
> > firmware that is aware of KHO...), instead of activation there should
> > be a user driver phase shift from old tree to new tree, once that is
> > done drivers can start serialize at will.
>
> If I understand you correctly, it's up driver to decide when to update the
> data that should be passed to the new kernel?

That is correct, I planning to propose drive dev->{driver,
bus}->liveupdate(dev, liveupdate_phase) callback, where  drivers can
preserve stuff into KHO during different phases of live update cycle:
before blackout during blackout, during shutdown. When implemented,
and when we perform liveupdate reboot, this call will be called
instead of shutdown() callback.

> Again, for now it's kexec limitation that kexec image cannot be altered
> between load and exec.
> Still, it's not clear to me how drivers could decide when they need to do
> the updates.

I will send the API proposal to the mailing list in a couple weeks,
and we can also discuss that at one of David's meetings.

Pasha





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux