On 12/03/24 at 06:06pm, Yan Zhao wrote: > On Mon, Dec 02, 2024 at 10:17:16PM +0800, Baoquan He wrote: > > On 11/29/24 at 01:52pm, Yan Zhao wrote: > > > On Thu, Nov 28, 2024 at 11:19:20PM +0800, Baoquan He wrote: > > > > On 11/27/24 at 06:01pm, Yan Zhao wrote: > > > > > On Tue, Nov 26, 2024 at 07:38:05PM +0800, Baoquan He wrote: > > > > > > On 10/24/24 at 08:15am, Yan Zhao wrote: > > > > > > > On Wed, Oct 23, 2024 at 10:44:11AM -0500, Eric W. Biederman wrote: > > > > > > > > "Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> writes: > > > > > > > > > > > > > > > > > Waiting minutes to get VM booted to shell is not feasible for most > > > > > > > > > deployments. Lazy is sane default to me. > > > > > > > > > > > > > > > > Huh? > > > > > > > > > > > > > > > > Unless my guesses about what is happening are wrong lazy is hiding > > > > > > > > a serious implementation deficiency. From all hardware I have seen > > > > > > > > taking minutes is absolutely ridiculous. > > > > > > > > > > > > > > > > Does writing to all of memory at full speed take minutes? How can such > > > > > > > > a system be functional? > > > > > > > > > > > > > > > > If you don't actually have to write to the pages and it is just some > > > > > > > > accounting function it is even more ridiculous. > > > > > > > > > > > > > > > > > > > > > > > > I had previously thought that accept_memory was the firmware call. > > > > > > > > Now that I see that it is just a wrapper for some hardware specific > > > > > > > > calls I am even more perplexed. > > > > > > > > > > > > > > > > > > > > > > > > Quite honestly what this looks like to me is that someone failed to > > > > > > > > enable write-combining or write-back caching when writing to memory > > > > > > > > when initializing the protected memory. With the result that everything > > > > > > > > is moving dog slow, and people are introducing complexity left and write > > > > > > > > to avoid that bad implementation. > > > > > > > > > > > > > > > > > > > > > > > > Can someone please explain to me why this accept_memory stuff has to be > > > > > > > > slow, why it has to take minutes to do it's job. > > > > > > > This kexec patch is a fix to a guest(TD)'s kexce failure. > > > > > > > > > > > > > > For a linux guest, the accept_memory() happens before the guest accesses a page. > > > > > > > It will (if the guest is a TD) > > > > > > > (1) trigger the host to allocate the physical page on host to map the accessed > > > > > > > guest page, which might be slow with wait and sleep involved, depending on > > > > > > > the memory pressure on host. > > > > > > > (2) initializing the protected page. > > > > > > > > > > > > > > Actually most of guest memory are not accessed by guest during the guest life > > > > > > > cycle. accept_memory() may cause the host to commit a never-to-be-used page, > > > > > > > with the host physical page not even being able to get swapped out. > > > > > > > > > > > > So this sounds to me more like a business requirement on cloud platform, > > > > > > e.g if one customer books a guest instance with 60G memory, while the > > > > > > customer actually always only cost 20G memory at most. Then the 40G memory > > > > > > can be saved to reduce pressure for host. > > > > > Yes. > > > > > > > > That's very interesting, thanks for confirming. > > > > > > > > > > > > > > > I could be shallow, just a wild guess. > > > > > > If my guess is right, at least those cloud service providers must like this > > > > > > accept_memory feature very much. > > > > > > > > > > > > > > > > > > > > That's why we need a lazy accept, which does not accept_memory() until after a > > > > > > > page is allocated by the kernel (in alloc_page(s)). > > > > > > > > > > > > By the way, I have two questions, maybe very shallow. > > > > > > > > > > > > 1) why can't we only find those already accepted memory to put kexec > > > > > > kernel/initrd/bootparam/purgatory? > > > > > > > > > > Currently, the first kernel only accepts memory during the memory allocation in > > > > > a lazy accept mode. Besides reducing boot time, it's also good for memory > > > > > over-commitment as you mentioned above. > > > > > > > > > > My understanding of why the memory for the kernel/initrd/bootparam/purgatory is > > > > > not allocated from the first kernel is that this memory usually needs to be > > > > > physically contiguous. Since this memory will not be used by the first kernel, > > > > > looking up from free RAM has a lower chance of failure compared to allocating it > > > > > > > > Well, there could be misunderstanding here.The final loaded position of > > > > kernel/initrd/bootparam/purgatory is not searched from free RAM, it's > > > Oh, by free RAM, I mean system RAM that is marked as > > > IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, but not marked as > > > IORESOURCE_SYSRAM_DRIVER_MANAGED. > > > > > > > > > > just from RAM on x86. Means it possibly have been allocated and being > > > > used by other component of 1st kernel. Not like kdump, the 2nd kernel of > > > Yes, it's entirely possible that the destination address being searched out has > > > already been allocated and is in use by the 1st kernel. e.g. for > > > KEXEC_TYPE_DEFAULT, the source page for each segment is allocated from the 1st > > > kernel, and it is allowed to have the same address as its corresponding > > > destination address. > > > > > > However, it's not guaranteed that the destination address must have been > > > allocated by the 1st kernel. > > > > > > > kexec reboot doesn't care about 1st kernel's memory usage. We will copy > > > > them from intermediat position to the designated location when jumping. > > > Right. If it's not guaranteed that the destination address has been accepted > > > before this copying, the copying could trigger an error due to accessing an > > > unaccepted page, which could be fatal for a linux TDX guest. > > > > Oh, I just said the opposite. I meant we could search according to the > > current unaccepted->bitmap to make sure the destination area definitely > > have been accepted. This is the best if doable, while I know it's not > > easy. > Well, this sounds like introducing a new constraint in addition to the current > checking of !kimage_is_destination_range() in locate_mem_hole_top_down() or > locate_mem_hole_bottom_up(). (powerpc also has a different implementation). > > This could make the success unpredictable, depending on how many pages have > been accepted by the 1st kernel and the layout of the accepted pages(e.g., > whether they are physically contiguous). The 1st kernel would also have no > reliable way to ensure success except by accepting all the guest pages. Yeah, when I finished reading accept_memory code, this is the first idea which come up into my mind. If it can be made, it's the most ideal. When I tried to make a draft change, it does introduce a lot of code change and add very much complication and I just gave up. Maybe this can be added to cover-letter too to tell this possible path we explored.