On 04/12/21 at 08:24am, Andy Lutomirski wrote: > On Mon, Apr 12, 2021 at 2:52 AM Baoquan He <bhe@xxxxxxxxxx> wrote: > > > > On 04/11/21 at 06:49pm, Andy Lutomirski wrote: > > > > > > > > > > On Apr 11, 2021, at 6:14 PM, Baoquan He <bhe@xxxxxxxxxx> wrote: > > > > > > > > On 04/09/21 at 07:59pm, H. Peter Anvin wrote: > > > >> Why don't we do this unconditionally? At the very best we gain half a megabyte of memory (except the trampoline, which has to live there, but it is only a few kilobytes.) > > > > > > > > This is a great suggestion, thanks. I think we can fix it in this way to > > > > make code simpler. Then the specific caring of real mode in > > > > efi_free_boot_services() can be removed too. > > > > > > > > > > This whole situation makes me think that the code is buggy before and buggy after. > > > > > > The issue here (I think) is that various pieces of code want to reserve specific pieces of otherwise-available low memory for their own nefarious uses. I don’t know *why* crash kernel needs this, but that doesn’t matter too much. > > > > Kdump kernel also need go through real mode code path during bootup. It > > is not different than normal kernel except that it skips the firmware > > resetting. So kdump kernel needs low 1M as system RAM just as normal > > kernel does. Here we reserve the whole low 1M with memblock_reserve() > > to avoid any later kernel or driver data reside in this area. Otherwise, > > we need dump the content of this area to vmcore. As we know, when crash > > happened, the old memory of 1st kernel should be untouched until vmcore > > dumping read out its content. Meanwhile, kdump kernel need reuse low 1M. > > In the past, we used a back up region to copy out the low 1M area, and > > map the back up region into the low 1M area in vmcore elf file. In > > 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel > > option is specified"), we changed to lock the whole low 1M to avoid > > writting any kernel data into, like this we can skip this area when > > dumping vmcore. > > > > Above is why we try to memblock reserve the whole low 1M. We don't want > > to use it, just don't want anyone to use it in 1st kernel. > > > > > > > > I propose that the right solution is to give low-memory-reserving code paths two chances to do what they need: once at the very beginning and once after EFI boot services are freed. > > > > > > Alternatively, just reserve *all* otherwise unused sub 1M memory up front, then release it right after releasing boot services, and then invoke the special cases exactly once. > > > > I am not sure if I got both suggested ways clearly. They look a little > > complicated in our case. As I explained at above, we want the whole low > > 1M locked up, not one piece or some pieces of it. > > My second suggestion is probably the better one. Here it is, concretely: > > The early (pre-free_efi_boot_services) code just reserves all > available sub-1M memory unconditionally, but it specially marks it as > reserved-but-available-later. We stop allocating the trampoline page > at this stage. > > In free_efi_boot_services, instead of *freeing* the sub-1M memory, we > stick it in the pile of reserved memory created in the early step. > This may involve splitting a block, kind of like the current > trampoline late allocation works. > > Then, *after* free_efi_boot_services(), we run a single block of code > that lets everything that wants sub-1M code claim some. This means > that the trampoline gets allocated and, if crashkernel wants to claim > everything else, it can. After that, everything still unclaimed gets > freed. void __init setup_arch(char **cmdline_p) { ... efi_reserve_boot_services(); e820__memblock_alloc_reserved_mpc_new(); #ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION setup_bios_corruption_check(); #endif reserve_real_mode(); trim_platform_memory_ranges(); trim_low_memory_range(); ... } After efi_reserve_boot_services(), there are several function calling to require memory reservation under low 1M. asmlinkage __visible void __init __no_sanitize_address start_kernel(void) { ... setup_arch(&command_line); ... mm_init(); --> mem_init(); -->memblock_free_all(); ... #ifdef CONFIG_X86 if (efi_enabled(EFI_RUNTIME_SERVICES)) efi_enter_virtual_mode(); -->efi_free_boot_services(); -->memblock_free_late(); #endif ... } So from the code flow, we can see that buddy allocator is built in mm_init() which puts all memory from memblock.memory excluding memblock.reserved into buddy. And much later, we call efi_free_boot_services() to release those reserved efi boot memory into buddy too. Are you suggesting we should do the memory reservation from low 1M after efi_free_boot_services()? To require memory pages from buddy for them? Please help point out my misunderstanding if have any. With my understanding, in non-efi case, we have done the memory reservation with memblock_reserve(), e.g e820__memblock_alloc_reserved_mpc_new, reserve_real_mode() are calling to do. Just efi_reserve|free_boot_services() break them when efi is enabled. We can do them again in efi_free_boot_services() just like the real_mode reservation does. Thanks Baoquan