On 9/27/23 16:25, Stanislav Kinsburskii wrote: > On Thu, Sep 28, 2023 at 06:22:54AM -0700, Dave Hansen wrote: >> On 9/27/23 09:13, Stanislav Kinsburskii wrote: >>> Once deposited, these pages can't be accessed by Linux anymore and thus >>> must be preserved in "used" state across kexec, as hypervisor state is >>> unware of kexec. >> >> If Linux can't access them, they're not RAM any more. I'd much rather >> remove them from the memory map and move on with life rather than >> implement a bunch of new ABI that's got to be handed across kernels. > > Could you elaborate more on the new ABIs? FDT is handled by x86 already, > and passing it over kexec looks like a natural extension. > Also, adding more state to it also doens't look like a new ABI. > Or does it? FDT makes it easier to pass arbitrary data around, but you're still creating a new "default_pmpool" device tree node on one end and consuming it on the other. That's a new ABI in my book. > Let me also comment on removing this regions from the memory map. The > major peculiarity here is that hypervisor distinguish between the pages, > deposited for guests to rnu and the pages deposited for the Linux root > partition to keep the guest-related portion of hypervisor state in the > root partition. And the latter is the matter in question. > > We can indeed isolate and deposit a excessive amount of memory upfront > in hope that hypervisor will never get into the situation, when it needs > more memory. > However, it's not reliable, as the amount of memory will always be an > estimation, depending on the number of expected guests, guest-attached > devices, etc. And this becomes even a bigger problem when most of the > memory is already removed from the memory map to host guest partitions. > It's also not efficient as the amount of memory required by hypervisor > can grow or shrink depending on the use case or host configuration, and > deposting excessive amount of memory will be a waste. > > But, actually, the idea of removing the pages from memory map was > reflected to some extent in the first version of this proposal, > so let me elaborate on it a bit. > > Effectively, instead of reserving and depositing a lot of memory to > hypervisor upfront, the memory can be allocated from kernel memory when > needed and then returned back when unused. > This would still require pages removal from the memory map upon kexec, > but that's another problem. Let's distill this down a bit. I agree that it's a waste to reserve an obscene amount of memory up front for all guests for rare cases. Having the amount of consumed memory grow is a nice feature. You can also quite easily *shrink* the amount of memory on a given kernel without new code. Right? The problem comes when you've grown the footprint of hypervisor-donated memory, kexec, and *THEN* want to shrink it. That's what needs new metadata to be communicated over to the new kernel. 1. Boot some kernel 2. Grow the deposited memory a bunch 3. Kexec 4. Shrink the deposited memory Right? That's where you lose me. Can't the deposited memory just be shrunk before kexec? Surely there aren't a bunch of pathological things consuming that memory right before kexec, which is basically a reboot.