Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Feb 8, 2025 at 4:14 PM Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote:
>
> On Sat, Feb 8, 2025 at 6:39 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >
> > Hi Mike,
> >
> > On Thu, Feb 6, 2025 at 5:28 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
> > >
> > > From: "Mike Rapoport (Microsoft)" <rppt@xxxxxxxxxx>
> > >
> > > Hi,
> > >
> > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@xxxxxxxxxx),
> > > just to make things simpler instead of ftrace we decided to preserve
> > > "reserve_mem" regions.
> > >
> > > The patches are also available in git:
> > > https://git.kernel.org/rppt/h/kho/v4
> > >
> > >
> > > Kexec today considers itself purely a boot loader: When we enter the new
> > > kernel, any state the previous kernel left behind is irrelevant and the
> > > new kernel reinitializes the system.
> > >
> > > However, there are use cases where this mode of operation is not what we
> > > actually want. In virtualization hosts for example, we want to use kexec
> > > to update the host kernel while virtual machine memory stays untouched.
> > > When we add device assignment to the mix, we also need to ensure that
> > > IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> > > need to do the same for the PCI subsystem. If we want to kexec while an
> > > SEV-SNP enabled virtual machine is running, we need to preserve the VM
> > > context pages and physical memory. See "pkernfs: Persisting guest memory
> > > and kernel/device state safely across kexec" Linux Plumbers
> > > Conference 2023 presentation for details:
> > >
> > >   https://lpc.events/event/17/contributions/1485/
> > >
> > > To start us on the journey to support all the use cases above, this patch
> > > implements basic infrastructure to allow hand over of kernel state across
> > > kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
> > > memblock's reserve_mem.
> > > With this patch set applied, memory that was reserved using "reserve_mem"
> > > command line options remains intact after kexec and it is guaranteed to
> > > reside at the same physical address.
> >
> > Nice work!
> >
> > One concern there is that using memblock to reserve memory as crashkernel=
> > is not flexible. I worked on kdump years ago and one of the biggest pains
> > of kdump is how much memory should be reserved with crashkernel=. And
> > it is still a pain today.
> >
> > If we reserve more, that would mean more waste for the 1st kernel. If we
> > reserve less, that would induce more OOM for the 2nd kernel.
> >
> > I'd suggest considering using CMA, where the "reserved" memory can be
> > still reusable for other purposes, just that pages can be migrated out of this
> > reserved region on demand, that is, when loading a kexec kernel. Of course,
> > we need to make sure they are not reused by what you want to preserve here,
> > e.g., IOMMU. So you might need additional work to make it work, but still I
> > believe this is the right direction.
>
> This is exactly what scratch memory is used for. Unlike crashkernel=,
> the entire scratch area is available to user applications as CMA, as
> we know that no kernel-reserved memory will come from that area. This
> doesn't work for crashkernel=, because in some cases, the user pages
> might also need to be preserved in the crash dump. However, if user
> pages are going to be discarded from the crash dump (as is done 99% of
> the time), then it is better to also make it use CMA or ZONE_MOVABLE
> and use only the memory occupied by the crash kernel and do not waste
> any memory at all. We have an internal patch at Google that does this,
> and I think it would be a good improvement for the upstream kernel to
> carry as well.

Good to know CMA is already used, I could not tell from the cover letter.

The case that user-space pages need to be preserved is for scenarios like
RDMA which pins user-space pages for DMA transfer. Since the goal here
is also to preserve hardware states like RDMA's I guess the same concern
remains.

Thanks!





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux