Re: + x86-vmemmap-use-direct-mapped-va-instead-of-vmemmap-based-va.patch added to mm-hotfixes-unstable branch

Dave Hansen <dave.hansen@xxxxxxxxx> · Tue, 18 Feb 2025 10:41:27 -0800

On 2/17/25 21:45, Andrew Morton wrote:
> When a process leads the addition of a struct page to vmemmap
> (e.g. hot-plug), the page table update for the newly added vmemmap-based
> virtual address is updated first in init_mm's page table and then
> synchronized later.
> If the vmemmap-based virtual address is accessed through the process's
> page table before this sync, a page fault will occur.

So, I think we're talking about the loop in vmemmap_populate_hugepages()
(with a bunch of context chopped out:

        for (addr = start; addr < end; addr = next) {
		...
                pgd = vmemmap_pgd_populate(addr, node);
                if (!pgd)
                        return -ENOMEM;
		...
		vmemmap_set_pmd(pmd, p, node, addr, next);
        }

This both creates a mapping under 'pgd' and uses the new mapping inside
vmemmap_set_pmd(). This is generally a known problem since
vmemmap_populate() already does a sync_global_pgds(). The reason it
manifests here is that the vmemmap_set_pmd() comes before the sync:

vmemmap_populate() {
	vmemmap_populate_hugepages() {
		vmemmap_pgd_populate(addr, node);
		...
		// crash:
		vmemmap_set_pmd(pmd, p, node, addr, next);
	}
	// too late:
	sync_global_pgds();
}

I really don't like the idea of having the x86 code just be super
careful not to use the newly-populated PGD (this patch). That's fragile
and further diverges the x86 implementation from the generic code.

The quick and dirty fix would be to just to call sync_global_pgds() all
the time, like:

pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
{
        pgd_t *pgd = pgd_offset_k(addr);
        if (pgd_none(*pgd)) {
                void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
                if (!p)
                        return NULL;
                pgd_populate(&init_mm, pgd, p);
+		sync_global_pgds(...);
        }
        return pgd;
}

That actually mirrors how __kernel_physical_mapping_init() does it:
watch for an actual PGD write and sync there. It shouldn't be too slow
because it only calls sync_global_pgds() during actual PGD population
which is horribly rare.

Could we do something like that, please? It might mean defining a new
__weak symbol in mm/sparse-vmemmap.c and then calling out to an x86
implementation like vmemmap_set_pmd().

Is x86 just an oddball with how it populates kernel page tables? I'm a
bit surprised nobody else has this problem too.