Changelog: v2: - Add comment in memblock_free_late() (suggested by Mike Rapoport) - Improve commit message, including an explanation of the x86_64 EFI boot issue (suggested by Mike Rapoport and David Rientjes) Hi all, (I've CC'ed the KMSAN and x86 EFI maintainers as an FYI; the only code change I'm proposing is in memblock.) I've run into a case where pages are not released from memblock to the buddy allocator. If deferred struct page init is enabled, and memblock_free_late() is called before page_alloc_init_late() has run, and the pages being freed are in the deferred init range, then the pages are never released. memblock_free_late() calls memblock_free_pages() which only releases the pages if they are not in the deferred range. That is correct for free pages because they will be initialized and released by page_alloc_init_late(), but memblock_free_late() is dealing with reserved pages. If memblock_free_late() doesn't release those pages, they will forever be reserved. All reserved pages were initialized by memblock_free_all(), so I believe the fix is to simply have memblock_free_late() call __free_pages_core() directly instead of memblock_free_pages(). In addition, there was a recent change (3c20650982609 "init: kmsan: call KMSAN initialization routines") that added a call to kmsan_memblock_free_pages() in memblock_free_pages(). It looks to me like it would also be incorrect to make that call in the memblock_free_late() case, because the KMSAN metadata was already initialized for all reserved pages by kmsan_init_shadow(), which runs before memblock_free_all(). Having memblock_free_late() call __free_pages_core() directly also fixes this issue. I encountered this issue when I tried to switch some x86_64 VMs I was running from BIOS boot to EFI boot. The x86 EFI code reserves all EFI boot services ranges via memblock_reserve() (part of setup_arch()), and it frees them later via memblock_free_late() (part of efi_enter_virtual_mode()). The EFI implementation of the VM I was attempting this on, an Amazon EC2 t3.micro instance, maps north of 170 MB in boot services ranges that happen to fall in the deferred init range. I certainly noticed when that much memory went missing on a 1 GB VM. I've tested the patch on EC2 instances, qemu/KVM VMs with OVMF, and some real x86_64 EFI systems, and they all look good to me. However, the physical systems that I have don't actually trigger this issue because they all have more than 4 GB of RAM, so their deferred init range starts above 4 GB (it's always in the highest zone and ZONE_DMA32 ends at 4 GB) while their EFI boot services mappings are below 4 GB. Deferred struct page init can't be enabled on x86_32 so those systems are unaffected. I haven't found any other code paths that would trigger this issue, though I can't promise that there aren't any. I did run with this patch on an arm64 VM as a sanity check, but memblock=debug didn't show any calls to memblock_free_late() so that system was unaffected as well. I am guessing that this change should also go the stable kernels but it may not apply cleanly (__free_pages_core() was __free_pages_boot_core() and memblock_free_pages() was __free_pages_bootmem() when this issue was first introduced). I haven't gone through that process before so please let me know if I can help with that. This is the end result on an EC2 t3.micro instance booting via EFI: v6.2-rc2: # grep -E 'Node|spanned|present|managed' /proc/zoneinfo Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 178867 v6.2-rc2 + patch: # grep -E 'Node|spanned|present|managed' /proc/zoneinfo Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 222816 Aaron Thompson (1): mm: Always release pages to the buddy allocator in memblock_free_late(). mm/memblock.c | 8 +++++++- tools/testing/memblock/internal.h | 4 ++++ 2 files changed, 11 insertions(+), 1 deletion(-) -- 2.30.2