Re: + x86-vmemmap-use-direct-mapped-va-instead-of-vmemmap-based-va.patch added to mm-hotfixes-unstable branch

Gwan-gyeong Mun <gwan-gyeong.mun@xxxxxxxxx> · Tue, 18 Feb 2025 23:17:29 +0200

On 2/18/25 8:41 PM, Dave Hansen wrote:
On 2/17/25 21:45, Andrew Morton wrote:
When a process leads the addition of a struct page to vmemmap
(e.g. hot-plug), the page table update for the newly added vmemmap-based
virtual address is updated first in init_mm's page table and then
synchronized later.
If the vmemmap-based virtual address is accessed through the process's
page table before this sync, a page fault will occur.

So, I think we're talking about the loop in vmemmap_populate_hugepages()
(with a bunch of context chopped out:

         for (addr = start; addr < end; addr = next) {
		...
                 pgd = vmemmap_pgd_populate(addr, node);
                 if (!pgd)
                         return -ENOMEM;
		...
		vmemmap_set_pmd(pmd, p, node, addr, next);
         }

This both creates a mapping under 'pgd' and uses the new mapping inside
vmemmap_set_pmd(). This is generally a known problem since
vmemmap_populate() already does a sync_global_pgds(). The reason it
manifests here is that the vmemmap_set_pmd() comes before the sync:

vmemmap_populate() {
	vmemmap_populate_hugepages() {
		vmemmap_pgd_populate(addr, node);
		...
		// crash:
		vmemmap_set_pmd(pmd, p, node, addr, next);
	}
	// too late:
	sync_global_pgds();
}

I really don't like the idea of having the x86 code just be super
careful not to use the newly-populated PGD (this patch). That's fragile
and further diverges the x86 implementation from the generic code.

The quick and dirty fix would be to just to call sync_global_pgds() all
the time, like:

pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
{
         pgd_t *pgd = pgd_offset_k(addr);
         if (pgd_none(*pgd)) {
                 void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
                 if (!p)
                         return NULL;
                 pgd_populate(&init_mm, pgd, p);
+		sync_global_pgds(...);
         }
         return pgd;
}

That actually mirrors how __kernel_physical_mapping_init() does it:
watch for an actual PGD write and sync there. It shouldn't be too slow
because it only calls sync_global_pgds() during actual PGD population
which is horribly rare.

Could we do something like that, please? It might mean defining a new
__weak symbol in mm/sparse-vmemmap.c and then calling out to an x86
implementation like vmemmap_set_pmd().

Hi Dave Hansen,

Thanks for the feedback. I do agree with a generic code rather than a 
diverse code for x86 implementation.
If everyone else agrees, I'll send a new patch in this style.

Is x86 just an oddball with how it populates kernel page tables? I'm a
bit surprised nobody else has this problem too.
Presumably, this won't happen on other platforms where they don't have 
the scenario of accessing that address before synchronizing the pgd.
Please correct me if I'm wrong.

Br,
G.G.