Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)

Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> · Sat, 8 Feb 2025 19:13:40 -0500

On Sat, Feb 8, 2025 at 6:39 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
>
> Hi Mike,
>
> On Thu, Feb 6, 2025 at 5:28 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
> >
> > From: "Mike Rapoport (Microsoft)" <rppt@xxxxxxxxxx>
> >
> > Hi,
> >
> > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > series (https://lore.kernel.org/all/20240117144704.602-1-graf@xxxxxxxxxx),
> > just to make things simpler instead of ftrace we decided to preserve
> > "reserve_mem" regions.
> >
> > The patches are also available in git:
> > https://git.kernel.org/rppt/h/kho/v4
> >
> >
> > Kexec today considers itself purely a boot loader: When we enter the new
> > kernel, any state the previous kernel left behind is irrelevant and the
> > new kernel reinitializes the system.
> >
> > However, there are use cases where this mode of operation is not what we
> > actually want. In virtualization hosts for example, we want to use kexec
> > to update the host kernel while virtual machine memory stays untouched.
> > When we add device assignment to the mix, we also need to ensure that
> > IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> > need to do the same for the PCI subsystem. If we want to kexec while an
> > SEV-SNP enabled virtual machine is running, we need to preserve the VM
> > context pages and physical memory. See "pkernfs: Persisting guest memory
> > and kernel/device state safely across kexec" Linux Plumbers
> > Conference 2023 presentation for details:
> >
> >   https://lpc.events/event/17/contributions/1485/
> >
> > To start us on the journey to support all the use cases above, this patch
> > implements basic infrastructure to allow hand over of kernel state across
> > kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
> > memblock's reserve_mem.
> > With this patch set applied, memory that was reserved using "reserve_mem"
> > command line options remains intact after kexec and it is guaranteed to
> > reside at the same physical address.
>
> Nice work!
>
> One concern there is that using memblock to reserve memory as crashkernel=
> is not flexible. I worked on kdump years ago and one of the biggest pains
> of kdump is how much memory should be reserved with crashkernel=. And
> it is still a pain today.
>
> If we reserve more, that would mean more waste for the 1st kernel. If we
> reserve less, that would induce more OOM for the 2nd kernel.
>
> I'd suggest considering using CMA, where the "reserved" memory can be
> still reusable for other purposes, just that pages can be migrated out of this
> reserved region on demand, that is, when loading a kexec kernel. Of course,
> we need to make sure they are not reused by what you want to preserve here,
> e.g., IOMMU. So you might need additional work to make it work, but still I
> believe this is the right direction.

This is exactly what scratch memory is used for. Unlike crashkernel=,
the entire scratch area is available to user applications as CMA, as
we know that no kernel-reserved memory will come from that area. This
doesn't work for crashkernel=, because in some cases, the user pages
might also need to be preserved in the crash dump. However, if user
pages are going to be discarded from the crash dump (as is done 99% of
the time), then it is better to also make it use CMA or ZONE_MOVABLE
and use only the memory occupied by the crash kernel and do not waste
any memory at all. We have an internal patch at Google that does this,
and I think it would be a good improvement for the upstream kernel to
carry as well.

Pasha

>
> Just my two cents.
>
> Thanks!