Re: [PATCH] kexec_core: Accept unaccepted kexec destination addresses

Baoquan He <bhe@xxxxxxxxxx> · Tue, 3 Dec 2024 18:30:36 +0800

On 12/03/24 at 06:06pm, Yan Zhao wrote:
> On Mon, Dec 02, 2024 at 10:17:16PM +0800, Baoquan He wrote:
> > On 11/29/24 at 01:52pm, Yan Zhao wrote:
> > > On Thu, Nov 28, 2024 at 11:19:20PM +0800, Baoquan He wrote:
> > > > On 11/27/24 at 06:01pm, Yan Zhao wrote:
> > > > > On Tue, Nov 26, 2024 at 07:38:05PM +0800, Baoquan He wrote:
> > > > > > On 10/24/24 at 08:15am, Yan Zhao wrote:
> > > > > > > On Wed, Oct 23, 2024 at 10:44:11AM -0500, Eric W. Biederman wrote:
> > > > > > > > "Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> writes:
> > > > > > > > 
> > > > > > > > > Waiting minutes to get VM booted to shell is not feasible for most
> > > > > > > > > deployments. Lazy is sane default to me.
> > > > > > > > 
> > > > > > > > Huh?
> > > > > > > > 
> > > > > > > > Unless my guesses about what is happening are wrong lazy is hiding
> > > > > > > > a serious implementation deficiency.  From all hardware I have seen
> > > > > > > > taking minutes is absolutely ridiculous.
> > > > > > > > 
> > > > > > > > Does writing to all of memory at full speed take minutes?  How can such
> > > > > > > > a system be functional?
> > > > > > > > 
> > > > > > > > If you don't actually have to write to the pages and it is just some
> > > > > > > > accounting function it is even more ridiculous.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > I had previously thought that accept_memory was the firmware call.
> > > > > > > > Now that I see that it is just a wrapper for some hardware specific
> > > > > > > > calls I am even more perplexed.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Quite honestly what this looks like to me is that someone failed to
> > > > > > > > enable write-combining or write-back caching when writing to memory
> > > > > > > > when initializing the protected memory.  With the result that everything
> > > > > > > > is moving dog slow, and people are introducing complexity left and write
> > > > > > > > to avoid that bad implementation.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Can someone please explain to me why this accept_memory stuff has to be
> > > > > > > > slow, why it has to take minutes to do it's job.
> > > > > > > This kexec patch is a fix to a guest(TD)'s kexce failure.
> > > > > > > 
> > > > > > > For a linux guest, the accept_memory() happens before the guest accesses a page.
> > > > > > > It will (if the guest is a TD)
> > > > > > > (1) trigger the host to allocate the physical page on host to map the accessed
> > > > > > >     guest page, which might be slow with wait and sleep involved, depending on
> > > > > > >     the memory pressure on host.
> > > > > > > (2) initializing the protected page.
> > > > > > > 
> > > > > > > Actually most of guest memory are not accessed by guest during the guest life
> > > > > > > cycle. accept_memory() may cause the host to commit a never-to-be-used page,
> > > > > > > with the host physical page not even being able to get swapped out.
> > > > > > 
> > > > > > So this sounds to me more like a business requirement on cloud platform,
> > > > > > e.g if one customer books a guest instance with 60G memory, while the
> > > > > > customer actually always only cost 20G memory at most. Then the 40G memory
> > > > > > can be saved to reduce pressure for host.
> > > > > Yes.
> > > > 
> > > > That's very interesting, thanks for confirming.
> > > > 
> > > > > 
> > > > > > I could be shallow, just a wild guess.
> > > > > > If my guess is right, at least those cloud service providers must like this
> > > > > > accept_memory feature very much.
> > > > > > 
> > > > > > > 
> > > > > > > That's why we need a lazy accept, which does not accept_memory() until after a
> > > > > > > page is allocated by the kernel (in alloc_page(s)).
> > > > > > 
> > > > > > By the way, I have two questions, maybe very shallow.
> > > > > > 
> > > > > > 1) why can't we only find those already accepted memory to put kexec
> > > > > > kernel/initrd/bootparam/purgatory?
> > > > > 
> > > > > Currently, the first kernel only accepts memory during the memory allocation in
> > > > > a lazy accept mode. Besides reducing boot time, it's also good for memory
> > > > > over-commitment as you mentioned above.
> > > > > 
> > > > > My understanding of why the memory for the kernel/initrd/bootparam/purgatory is
> > > > > not allocated from the first kernel is that this memory usually needs to be
> > > > > physically contiguous. Since this memory will not be used by the first kernel,
> > > > > looking up from free RAM has a lower chance of failure compared to allocating it
> > > > 
> > > > Well, there could be misunderstanding here.The final loaded position of
> > > > kernel/initrd/bootparam/purgatory is not searched from free RAM, it's
> > > Oh, by free RAM, I mean system RAM that is marked as
> > > IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, but not marked as
> > > IORESOURCE_SYSRAM_DRIVER_MANAGED.
> > > 
> > > 
> > > > just from RAM on x86. Means it possibly have been allocated and being
> > > > used by other component of 1st kernel. Not like kdump, the 2nd kernel of
> > > Yes, it's entirely possible that the destination address being searched out has
> > > already been allocated and is in use by the 1st kernel. e.g. for
> > > KEXEC_TYPE_DEFAULT, the source page for each segment is allocated from the 1st
> > > kernel, and it is allowed to have the same address as its corresponding
> > > destination address.
> > > 
> > > However, it's not guaranteed that the destination address must have been
> > > allocated by the 1st kernel.
> > > 
> > > > kexec reboot doesn't care about 1st kernel's memory usage. We will copy
> > > > them from intermediat position to the designated location when jumping.
> > > Right. If it's not guaranteed that the destination address has been accepted
> > > before this copying, the copying could trigger an error due to accessing an
> > > unaccepted page, which could be fatal for a linux TDX guest.
> > 
> > Oh, I just said the opposite. I meant we could search according to the
> > current unaccepted->bitmap to make sure the destination area definitely
> > have been accepted. This is the best if doable, while I know it's not
> > easy.
> Well, this sounds like introducing a new constraint in addition to the current
> checking of !kimage_is_destination_range() in locate_mem_hole_top_down() or
> locate_mem_hole_bottom_up(). (powerpc also has a different implementation).
> 
> This could make the success unpredictable, depending on how many pages have
> been accepted by the 1st kernel and the layout of the accepted pages(e.g.,
> whether they are physically contiguous). The 1st kernel would also have no
> reliable way to ensure success except by accepting all the guest pages.

Yeah, when I finished reading accept_memory code, this is the first idea
which come up into my mind. If it can be made, it's the most ideal. When
I tried to make a draft change, it does introduce a lot of code change and
add very much complication and I just gave up.

Maybe this can be added to cover-letter too to tell this possible path we
explored.