Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

Jann Horn <jannh@xxxxxxxxxx> · Tue, 25 Oct 2022 15:44:34 +0200

On Tue, Oct 25, 2022 at 10:11 AM Alistair Popple <apopple@xxxxxxxxxx> wrote:
>
>
> Matthew Wilcox <willy@xxxxxxxxxxxxx> writes:
>
> > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote:
> >> """
> >> This guarantees that the page tables that are being walked
> >> aren't freed concurrently, but at the end of the walk, we
> >> have to grab a stable reference to the referenced page.
> >> For this we use the grab-reference-and-revalidate trick
> >> from above again:
> >> First we (locklessly) load the page
> >> table entry, then we grab a reference to the page that it
> >> points to (which can fail if the refcount is zero, in that
> >> case we bail), then we recheck that the page table entry
> >> is still the same, and if it changed in between, we drop the
> >> page reference and bail.
> >> This can, again, grab a reference to a page after it has
> >> already been freed and reallocated. The reason why this is
> >> fine is that the metadata structure that holds this refcount,
> >> `struct folio` (or `struct page`, depending on which kernel
> >> version you're looking at; in current kernels it's `folio`
> >> but `struct page` and `struct folio` are actually aliases for
> >> the same memory, basically, though that is supposed to maybe
> >> change at some point) is never freed; even when a page is
> >> freed and reallocated, the corresponding `struct folio`
> >> stays. This does have the fun consequence that whenever a
> >> page/folio has a non-zero refcount, the refcount can
> >> spuriously go up and then back down for a little bit.
> >> (Also it's technically not as simple as I just described it,
> >> because the `struct page` that the PTE points to might be
> >> a "tail page" of a `struct folio`.
> >> So actually we first read the PTE, the PTE gives us the
> >> `page*`, then from that we go to the `folio*`, then we
> >> try to grab a reference to the `folio`, then if that worked
> >> we check that the `page` still points to the same `folio`,
> >> and then we recheck that the PTE is still the same.)
> >> """
> >
> > Nngh.  In trying to make this description fit all kernels (with
> > both pages and folios), you've complicated it maximally.  Let's
> > try a more simple explanation:
> >
> > First we (locklessly) load the page table entry, then we grab a
> > reference to the folio that contains it (which can fail if the
> > refcount is zero, in that case we bail), then we recheck that the
> > page table entry is still the same, and if it changed in between,
> > we drop the folio reference and bail.
> > This can, again, grab a reference to a folio after it has
> > already been freed and reallocated. The reason why this is
> > fine is that the metadata structure that holds this refcount,
> > `struct folio` is never freed; even when a folio is
> > freed and reallocated, the corresponding `struct folio`
> > stays.

Oh, thanks. You're right, trying to talk about kernels with folios
made it unnecessarily complicated...

> I'm probably missing something obvious but how is that synchronised
> against memory hotplug? AFAICT if it isn't couldn't the pages be freed
> and memory removed? In that case the above would no longer hold because
> (I think) the metadata structure could have been freed.

Hm... that's this codepath?

arch_remove_memory -> __remove_pages -> __remove_section ->
sparse_remove_section -> section_deactivate ->
depopulate_section_memmap -> vmemmap_free -> remove_pagetable

which then walks down the page tables and ends up freeing individual
pages in remove_pte_table() using the confusingly-named
free_pagetable()?

I'm not sure what the synchronization against hotplug is - GUP-fast is
running with IRQs disabled, but other codepaths might not, like
get_ksm_page()? I don't know if that's holding something else for protection...