Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize

Yu Zhao <yuzhao@xxxxxxxxxx> · Thu, 11 Jul 2024 11:38:40 -0600

On Thu, Jul 11, 2024 at 5:39 AM Catalin Marinas <catalin.marinas@xxxxxxx> wrote:
>
> On Thu, Jul 11, 2024 at 02:31:25AM -0600, Yu Zhao wrote:
> > On Wed, Jul 10, 2024 at 5:07 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> > > On Wed, Jul 10, 2024 at 4:29 PM Catalin Marinas <catalin.marinas@xxxxxxx> wrote:
> > > > The Arm ARM states that we need a BBM if we change the output address
> > > > and: the old or new mappings are RW *or* the content of the page
> > > > changes. Ignoring the latter (page content), we can turn the PTEs RO
> > > > first without changing the pfn followed by changing the pfn while they
> > > > are RO. Once that's done, we make entry 0 RW and, of course, with
> > > > additional TLBIs between all these steps.
> > >
> > > Aha! This is easy to do -- I just made the RO guaranteed, as I
> > > mentioned earlier.
> > >
> > > Just to make sure I fully understand the workflow:
> > >
> > > 1. Split a RW PMD into 512 RO PTEs, pointing to the same 2MB `struct page` area.
>
> I don't think we can turn all of them RO here since some of those 512
> PTEs are not related to the hugetlb page. So you'd need to keep them RW
> but preserving the pfn so that there's no actual translation change. I
> think that's covered by FEAT_BBM level 2. Basically this step should be
> only about breaking up a PMD block entry into a table entry.

Ack.

> > > 2. TLBI once, after pmd_populate_kernel()
> > > 3. Remap PTE 1-7 to the 4KB `struct page` area of PTE 0, for every 8
> > >    PTEs, while they remain RO.
>
> You may need some intermediate step to turn these PTEs read-only since
> step 1 should leave them RW. Also, if we want to free and order-3 page
> here, it might be better to allocate an order 0 even for PTE entry 0 (I
> had the impression that's what the core code does, I haven't checked).

Ack.

> > > 4. TLBI once, after set_pte_at() on PTE 1-7.
> > > 5. Change PTE 0 from RO to RW, pointing to the same 4KB `struct page` area.
> > > 6. TLBI once, after set_pte_at() on PTE 0.
> > >
> > > No BBM required, regardless of FEAT_BBM level 2.
> >
> > I just studied D8.16.1 from the reference manual, and it seems to me:
> > 1. We still need either FEAT_BBM or BBM to split PMD.
>
> Yes.

Also, I want to confirm my understanding of "changing table size" from
the reference manual: in our case, it means splitting a PMD into 512
PTEs with the same permission and OA. If we change the permission *or*
OA, we still need to do BBM even with FEAT_BBM level 2. Is this
correct?

> > 2. We still need BBM when we change PTE 1-7, because even if they
> > remain RO, the content of the `struct page` page at the new location
> > does not match that at the old location.
>
> Yes, in theory, the data at the new pfn should be the same. We could try
> to get clarification from the architects on what could go wrong but I
> suspect it's some atomicity is not guarantee if you read the data (the
> CPU getting confused whether to read from the old or the new page).
>
> Otherwise, since after all these steps PTEs 1-7 point to the same data
> as PTE 0, before step 3 we could copy the data in page 0 over to the
> other 7 pages while entries 1-7 are still RO. The remapping afterwards
> would be fully compliant.

Correct -- we do need to copy to make it fully compliant because the
core MM doesn't guarantee that. The core MM only guarantees fields (of
struct page) required for speculative PFN walkers to function
correctly have the same value for all tail pages within a compound
page. Non-correctness related fields in theory can have different
values for those tail pages.

> > > > Can we leave entry 0 RO? This would save an additional TLBI.
> > >
> > > Unfortunately we can't. Otherwise we wouldn't be able to, e.g., grab a
> > > refcnt on any hugeTLB pages.
>
> OK, fair enough.
>
> > > > Now, I wonder if all this is worth it. What are the scenarios where the
> > > > 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB
> > > > hugetlb page for example is pretty well defined - 8 x 4K pages, aligned.
> >
> > One of the fundamental assumptions in core MM is that anyone can
> > read or try to grab (write) a refcnt from any `struct page`. Those
> > speculative PFN walkers include memory compaction, etc.
>
> But how does this work if PTEs 1-7 are RO? Do those walkers detect it's
> a tail page and skip it.

Correct.

> Actually, if they all point to the same vmemmap
> page, how can one distinguish a tail page via PTE 1 from the head page
> via PTE 0?

Two of the correctness related fields are page->_refcount and
page->compound_head:
1. _refcount is the only one that can be speculatively updated.
Speculative walkers are not allowed to update other fields unless they
can grab a refcount. All tail pages must have zero refcount.
2. compound_head speculatively indicates whether a page is head or
tail, and if it's tail, its head can be extracted by compound_head().
Since a head can have non-zero refcount, after PTEs 1-7 are remapped
to PTE 0, we need a way to prevent speculative walkers from mistaking
the first tail for each PTE 1-7 for the head and trying to grab their
refcount. This is done by page_is_fake_head() returning true, which
relies on the following sequence on.
On the writer side:
2a. init compound_head
2b. reset _refcount to 0
2c. synchronize_rcu()
2d. remap PTEs 1-7 to PTE 0
2e. inc _refcount
Speculative readers of the first tails respectively at PTEs 1-7 either
see refcount being 0 or page_is_fake_head() being true.

> BTW, I'll be on holiday from tomorrow for two weeks and won't be able to
> follow up on this thread (and likely to forget all the discussion by the
> time I get back ;)).

Thanks for the heads-up!