Hello Dave, On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote: > > > Fundamentally when deciding where to place a new kernel kexec (either > > > user space or the in kernel kexec_file implementation) needs to be able > > > to ask the question which memory ares are reserved. [...] > > > So my question is why doesn't the ESRT reservation wind up in > > > /proc/iomem? > > > > My guess is that the focus was that some EFI structures need to be kept > > around accross the life cycle of *one* running kernel and > > memblock_reserve() was enough for that. Marking them so they survive > > kexecing another kernel might just never have cropped up thus far. Ard > > or Matt would know. > Can you check your un-reserved memory, if your memory falls into EFI > BOOT* then in X86 you can use something like below if it is not covered: > void __init efi_esrt_init(void) > { > ... > pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end); > if (md.type == EFI_BOOT_SERVICES_DATA) > efi_mem_reserve(esrt_data, esrt_data_size); > ... > } Please bear with me if I'm a bit slow on the uptake here: On my machine, the esrt module reports at boot: [ 0.001244] esrt: Reserving ESRT space from 0x0000000074dd2f98 to 0x0000000074dd2fd0. This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the code you quote reserve it using memblock_reserve() shown by memblock=debug: [ 0.001246] memblock_reserve: [0x0000000074dd2f98-0x0000000074dd2fcf] efi_mem_reserve+0x1d/0x2b It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve() which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't as shown by efi=debug: [ 0.178111] efi: mem10: [Boot Data | | | | | | | | | |WB|WT|WC|UC] range=[0x0000000074dd3000-0x0000000075becfff] (14MB) [ 0.178113] efi: mem11: [Boot Data |RUN| | | | | | | | |WB|WT|WC|UC] range=[0x0000000074dd2000-0x0000000074dd2fff] (0MB) [ 0.178114] efi: mem12: [Boot Data | | | | | | | | | |WB|WT|WC|UC] range=[0x000000006d635000-0x0000000074dd1fff] (119MB) This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services() from calling __memblock_free_late() on it. And indeed, memblock=debug does not report this area as being free'd while the surrounding ones are: [ 0.178369] __memblock_free_late: [0x0000000074dd3000-0x0000000075becfff] efi_free_boot_services+0x126/0x1f8 [ 0.178658] __memblock_free_late: [0x000000006d635000-0x0000000074dd1fff] efi_free_boot_services+0x126/0x1f8 The esrt area does not show up in /proc/iomem though: 00100000-763f5fff : System RAM 62000000-62a00d80 : Kernel code 62c00000-62f15fff : Kernel rodata 63000000-630ea8bf : Kernel data 63fed000-641fffff : Kernel bss 65000000-6affffff : Crash kernel And thus kexec loads the new kernel right over that area as shown when enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x73000000 and 0x73000000+0x24be000 = 0x754be000): [ 650.007695] kexec_file: Loading segment 0: buf=0x000000003a9c84d6 bufsz=0x5000 mem=0x98000 memsz=0x6000 [ 650.007699] kexec_file: Loading segment 1: buf=0x0000000017b2b9e6 bufsz=0x1240 mem=0x96000 memsz=0x2000 [ 650.007703] kexec_file: Loading segment 2: buf=0x00000000fdf72ba2 bufsz=0x1150888 mem=0x73000000 memsz=0x24be000 ... because it looks for any memory hole large enough in iomem resources tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be excluded from on my system. Looking some more at efi_arch_mem_reserve() I see that it also registers the area with efi.memmap and installs it using efi_memmap_install(). which seems to call memremap(MEMREMAP_WB) on it. From my understanding of the comments in the source of memremap(), MEMREMAP_WB does specifically *not* reserve that memory in any way. > Unfortunately I noticed there are different requirements/ways for > different types of "reserved" memory. But that is another topic.. I tried to reserve the area with something like this: t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 4de244683a7e..b86a5df027a2 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -249,6 +249,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size) efi_memory_desc_t md; int num_entries; void *new; + struct resource *res; if (efi_mem_desc_lookup(addr, &md) || md.type != EFI_BOOT_SERVICES_DATA) { @@ -294,6 +295,21 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size) early_memunmap(new, new_size); efi_memmap_install(new_phys, num_entries); + + res = memblock_alloc(sizeof(*res), SMP_CACHE_BYTES); + if (!res) { + pr_err("Failed to allocate EFI io resource allocator for " + "0x%llx:0x%llx", mr.range.start, mr.range.end); + return; + } + + res->start = mr.range.start; + res->end = mr.range.end; + res->name = "EFI runtime"; + res->flags = IORESOURCE_MEM | IORESOURCE_BUSY; + res->desc = IORES_DESC_NONE; + + insert_resource(&iomem_resource, res); } /* ... but failed miserably in terms of the kernel not booting because I have no experience whatsoever in programming and debugging early kernel init. But I am somewhat keen to ride the learning curve here. :) Am I on the right track or were you a couple of leaps ahead of me already and I just didn't get the question? -- Thanks, Michael