Re: Unifying page table walkers

Peter Xu <peterx@xxxxxxxxxx> · Thu, 6 Jun 2024 17:49:30 -0400

On Thu, Jun 06, 2024 at 07:29:22PM +0100, Matthew Wilcox wrote:
> The reason we have a separate hugetlb_entry from pmd_entry and pud_entry
> is that it has a different locking context.  It is called with the
> hugetlb_vma_lock held for read (nb: this is not the same as the vma
> lock; see walk_hugetlb_range()).  Why do we need this?  Because of page
> table sharing.

Just to quickly comment on this one: I think it's more than the per-vma
lock.  Oscar is actually working together with me (we had plenty of
discussions but so far all offlist...), and the lock context is as simple
as this after refactor for hugetlb_entry() path:

https://github.com/leberus/linux/commit/88e56c1ecaf8c64ba9165aeba74335bdc15d1b56

hugetlb_entry() existed also because that's the only sane way to link to
the hugetlb API (used to be huge_pte_offset() I believe, now
hugetlb_walk()), which always walk to a specific level of hugetlb pgtable
but without even telling the caller (hence the pte_t* force-cast trick).
Then pxd_entry() won't apply if we don't know that info.  So it's probably
not only about the locking.

Meanwhile, I had a very vague memory that the per-vma lock is also used for
something else, perhaps fallocate() race against faults or something.  But
maybe I misremembered; I didn't read that part of code for quite some time,
as our hugetlb refactoring work doesn't need that knowledge involved: we
simply keep all the behaviors.  Maybe Muchun could remember.

Thanks,

-- 
Peter Xu