Hi Eric, This is a repost of the patch "kexec_core: Accept unaccepted kexec destination addresses" [1], rebased to v6.13-rc2. The code implementation remains unchanged, but the patch message now includes more background and explanations to address previous concerns from you and Baoquan. Additionally, below is a more detailed explanation of unaccepted memory in TDX. Please let me know if it is still not clear enough. == UnAccepted memory in TDX == Intel TDX (Trusted Domain Extension) provides a hardware-based trusted execution environment for TDs (hardware-isolated VMs). The host OS is not trusted. Although it allocates physical pages for TDs, it does not and cannot know the content of TD's pages. TD's memory is added via two methods by invoking different instructions in the host: 1. For TD's initial private memory, such as for firmware HOBs: - This type of memory is added without requiring the TD's acceptance. - The TD will perform attestation of the page GPA and content later. 2. For TD's runtime private memory: - After the host adds memory, it is pending for the TD's acceptance. Memory added by method 1 is not relevant to the unaccepted memory we will discuss. For memory added by method 2, the TD's acceptance can occur before or after the TD's memory access: (a) Access first: - TD accesses a private GPA, - Host OS allocates physical memory, - Host OS requests hardware to map the physical page to the GPA, - TD accepts the GPA. (b) Accept first: - TD accepts a private GPA, - Host OS allocates physical memory, - Host OS requests hardware to map the physical page to the GPA, - TD accesses the GPA. For "(a) Access first", it is regarded as unsafe for a Linux guest and is therefore not chosen. For "(b) Accept first", the TD's "accept" operation includes the following steps: - Trigger a VM-exit - The host OS allocates a physical page and requests hardware to map the physical page to the GPA. - Initialize the physical page with content set to 0. - Encrypt the memory To enable the "Accept first" approach, an "unaccepted memory" mechanism is used, which requires cooperation from the virtual firmware and the Linux guest. 1. The host OS adds initial private memory that does not require TD's acceptance. The host OS composes EFI_HOB_RESOURCE_DESCRIPTORs and loads the virtual firmware first. Guest RAM, excluding that for initial memory, is reported as UNACCEPTED in the descriptor. 2. The virtual firmware parses the descriptors and accepts the UNACCEPTED memory below 4G. It then excludes the below-4G range from the UNACCEPTED range. 3. The virtual firmware loads the Linux guest image (the address to load is below 4G). 4. The Linux guest requests the UNACCEPTED bitmap from the virtual firmware: - Locate EFI_UNACCEPTED_MEMORY entries from the memory map returned by the efi_get_memory_map boot service. - Request via EFI boot service to allocate an unaccepted_table in memory of type EFI_ACPI_RECLAIM_MEMORY (E820_TYPE_ACPI) to hold the unaccepted bitmap. - Install the unaccepted_table as an EFI configuration table via the boot service. - Initialize the unaccepted bitmap according to the EFI_UNACCEPTED_MEMORY entries. 5. The Linux guest decompresses the kernel image. It accepts the target GPA for decompression first in case it is not accepted by the virtual firmware. 6. The Linux guest calls memblock_free_all() to put all memory into the freelists for the buddy allocator. memblock_free_all() further calls down to __free_pages_core() to handle memory in 4M (order 10) units. - In eager mode, the Linux guest accepts all memory and appends it to the freelists. - In lazy mode, the Linux guest checks if the entire 4M memory has been accepted by querying the unaccepted bitmap. a) If all memory is accepted, it adds the 4M memory to the freelists. b) If any memory is unaccepted (even if the range contains accepted pages), the Linux guest does not add the 4M memory to the freelists. Instead, it queues the first page in the 4M range onto the list zone->unaccepted_pages and sets the first page with the Unaccepted flag. 7. When there is not enough free memory, cond_accept_memory() in the Linux guest calls try_to_accept_memory_one() to dequeue a page from the list zone->unaccepted_pages, clear its Unaccepted flag, accept the entire 4M memory range represented by the page, and add the 4M memory to the freelists. == Conclusion == - The zone->unaccepted_pages is a mechanism to conditionally make accepted private memory available to the page allocators. - The unaccepted bitmap resides in the firmware's reserved memory and persists across guest OSs. It records exactly which pages have not been accepted. - Memory ranges represented by zone->unaccepted_pages may contain accepted pages. For kexec in TDs, - If the segments' destination addresses are within the range managed by the buddy allocator, the pages must have been in an accepted state. Calling accept_memory() will check the unaccepted bitmap and do nothing. - If the segments' destination addresses are not yet managed by the buddy allocator, the pages may or may not have been accepted. Calling accept_memory() will perform the "accept" operation if they are not accepted. For the kexec's second guest kernel, it obtains the unaccepted bitmap by locating the unaccepted_table in the EFI configuration tables. So, pages unset in the unaccepted bitmap are not accepted repeatedly. The unaccepted table/bitmap is only useful for TDs. For a Linux host, it will detect that the physical firmware does not support the memory acceptance protocol, and accept_memory() will simply bail out. Thanks Yan [1] https://lore.kernel.org/all/20241021034553.18824-1-yan.y.zhao@xxxxxxxxx Yan Zhao (1): kexec_core: Accept unaccepted kexec segments' destination addresses kernel/kexec_core.c | 10 ++++++++++ 1 file changed, 10 insertions(+) -- 2.43.2