On Sat, Feb 8, 2025 at 4:14 PM Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote: > > On Sat, Feb 8, 2025 at 6:39 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: > > > > Hi Mike, > > > > On Thu, Feb 6, 2025 at 5:28 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote: > > > > > > From: "Mike Rapoport (Microsoft)" <rppt@xxxxxxxxxx> > > > > > > Hi, > > > > > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers" > > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@xxxxxxxxxx), > > > just to make things simpler instead of ftrace we decided to preserve > > > "reserve_mem" regions. > > > > > > The patches are also available in git: > > > https://git.kernel.org/rppt/h/kho/v4 > > > > > > > > > Kexec today considers itself purely a boot loader: When we enter the new > > > kernel, any state the previous kernel left behind is irrelevant and the > > > new kernel reinitializes the system. > > > > > > However, there are use cases where this mode of operation is not what we > > > actually want. In virtualization hosts for example, we want to use kexec > > > to update the host kernel while virtual machine memory stays untouched. > > > When we add device assignment to the mix, we also need to ensure that > > > IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we > > > need to do the same for the PCI subsystem. If we want to kexec while an > > > SEV-SNP enabled virtual machine is running, we need to preserve the VM > > > context pages and physical memory. See "pkernfs: Persisting guest memory > > > and kernel/device state safely across kexec" Linux Plumbers > > > Conference 2023 presentation for details: > > > > > > https://lpc.events/event/17/contributions/1485/ > > > > > > To start us on the journey to support all the use cases above, this patch > > > implements basic infrastructure to allow hand over of kernel state across > > > kexec (Kexec HandOver, aka KHO). As a really simple example target, we use > > > memblock's reserve_mem. > > > With this patch set applied, memory that was reserved using "reserve_mem" > > > command line options remains intact after kexec and it is guaranteed to > > > reside at the same physical address. > > > > Nice work! > > > > One concern there is that using memblock to reserve memory as crashkernel= > > is not flexible. I worked on kdump years ago and one of the biggest pains > > of kdump is how much memory should be reserved with crashkernel=. And > > it is still a pain today. > > > > If we reserve more, that would mean more waste for the 1st kernel. If we > > reserve less, that would induce more OOM for the 2nd kernel. > > > > I'd suggest considering using CMA, where the "reserved" memory can be > > still reusable for other purposes, just that pages can be migrated out of this > > reserved region on demand, that is, when loading a kexec kernel. Of course, > > we need to make sure they are not reused by what you want to preserve here, > > e.g., IOMMU. So you might need additional work to make it work, but still I > > believe this is the right direction. > > This is exactly what scratch memory is used for. Unlike crashkernel=, > the entire scratch area is available to user applications as CMA, as > we know that no kernel-reserved memory will come from that area. This > doesn't work for crashkernel=, because in some cases, the user pages > might also need to be preserved in the crash dump. However, if user > pages are going to be discarded from the crash dump (as is done 99% of > the time), then it is better to also make it use CMA or ZONE_MOVABLE > and use only the memory occupied by the crash kernel and do not waste > any memory at all. We have an internal patch at Google that does this, > and I think it would be a good improvement for the upstream kernel to > carry as well. Good to know CMA is already used, I could not tell from the cover letter. The case that user-space pages need to be preserved is for scenarios like RDMA which pins user-space pages for DMA transfer. Since the goal here is also to preserve hardware states like RDMA's I guess the same concern remains. Thanks!