On 2/17/25 21:45, Andrew Morton wrote: > When a process leads the addition of a struct page to vmemmap > (e.g. hot-plug), the page table update for the newly added vmemmap-based > virtual address is updated first in init_mm's page table and then > synchronized later. > If the vmemmap-based virtual address is accessed through the process's > page table before this sync, a page fault will occur. So, I think we're talking about the loop in vmemmap_populate_hugepages() (with a bunch of context chopped out: for (addr = start; addr < end; addr = next) { ... pgd = vmemmap_pgd_populate(addr, node); if (!pgd) return -ENOMEM; ... vmemmap_set_pmd(pmd, p, node, addr, next); } This both creates a mapping under 'pgd' and uses the new mapping inside vmemmap_set_pmd(). This is generally a known problem since vmemmap_populate() already does a sync_global_pgds(). The reason it manifests here is that the vmemmap_set_pmd() comes before the sync: vmemmap_populate() { vmemmap_populate_hugepages() { vmemmap_pgd_populate(addr, node); ... // crash: vmemmap_set_pmd(pmd, p, node, addr, next); } // too late: sync_global_pgds(); } I really don't like the idea of having the x86 code just be super careful not to use the newly-populated PGD (this patch). That's fragile and further diverges the x86 implementation from the generic code. The quick and dirty fix would be to just to call sync_global_pgds() all the time, like: pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node) { pgd_t *pgd = pgd_offset_k(addr); if (pgd_none(*pgd)) { void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); if (!p) return NULL; pgd_populate(&init_mm, pgd, p); + sync_global_pgds(...); } return pgd; } That actually mirrors how __kernel_physical_mapping_init() does it: watch for an actual PGD write and sync there. It shouldn't be too slow because it only calls sync_global_pgds() during actual PGD population which is horribly rare. Could we do something like that, please? It might mean defining a new __weak symbol in mm/sparse-vmemmap.c and then calling out to an x86 implementation like vmemmap_set_pmd(). Is x86 just an oddball with how it populates kernel page tables? I'm a bit surprised nobody else has this problem too.