On Thu, Jun 06, 2024 at 07:29:22PM +0100, Matthew Wilcox wrote: > The reason we have a separate hugetlb_entry from pmd_entry and pud_entry > is that it has a different locking context. It is called with the > hugetlb_vma_lock held for read (nb: this is not the same as the vma > lock; see walk_hugetlb_range()). Why do we need this? Because of page > table sharing. Just to quickly comment on this one: I think it's more than the per-vma lock. Oscar is actually working together with me (we had plenty of discussions but so far all offlist...), and the lock context is as simple as this after refactor for hugetlb_entry() path: https://github.com/leberus/linux/commit/88e56c1ecaf8c64ba9165aeba74335bdc15d1b56 hugetlb_entry() existed also because that's the only sane way to link to the hugetlb API (used to be huge_pte_offset() I believe, now hugetlb_walk()), which always walk to a specific level of hugetlb pgtable but without even telling the caller (hence the pte_t* force-cast trick). Then pxd_entry() won't apply if we don't know that info. So it's probably not only about the locking. Meanwhile, I had a very vague memory that the per-vma lock is also used for something else, perhaps fallocate() race against faults or something. But maybe I misremembered; I didn't read that part of code for quite some time, as our hugetlb refactoring work doesn't need that knowledge involved: we simply keep all the behaviors. Maybe Muchun could remember. Thanks, -- Peter Xu