Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

Alistair Popple <apopple@xxxxxxxxxx> · Wed, 26 Oct 2022 11:45:49 +1100

Jann Horn <jannh@xxxxxxxxxx> writes:

> On Tue, Oct 25, 2022 at 10:11 AM Alistair Popple <apopple@xxxxxxxxxx> wrote:
>>
>>
>> Matthew Wilcox <willy@xxxxxxxxxxxxx> writes:
>>
>> > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote:
>> >> """
>> >> This guarantees that the page tables that are being walked
>> >> aren't freed concurrently, but at the end of the walk, we
>> >> have to grab a stable reference to the referenced page.
>> >> For this we use the grab-reference-and-revalidate trick
>> >> from above again:
>> >> First we (locklessly) load the page
>> >> table entry, then we grab a reference to the page that it
>> >> points to (which can fail if the refcount is zero, in that
>> >> case we bail), then we recheck that the page table entry
>> >> is still the same, and if it changed in between, we drop the
>> >> page reference and bail.
>> >> This can, again, grab a reference to a page after it has
>> >> already been freed and reallocated. The reason why this is
>> >> fine is that the metadata structure that holds this refcount,
>> >> `struct folio` (or `struct page`, depending on which kernel
>> >> version you're looking at; in current kernels it's `folio`
>> >> but `struct page` and `struct folio` are actually aliases for
>> >> the same memory, basically, though that is supposed to maybe
>> >> change at some point) is never freed; even when a page is
>> >> freed and reallocated, the corresponding `struct folio`
>> >> stays. This does have the fun consequence that whenever a
>> >> page/folio has a non-zero refcount, the refcount can
>> >> spuriously go up and then back down for a little bit.
>> >> (Also it's technically not as simple as I just described it,
>> >> because the `struct page` that the PTE points to might be
>> >> a "tail page" of a `struct folio`.
>> >> So actually we first read the PTE, the PTE gives us the
>> >> `page*`, then from that we go to the `folio*`, then we
>> >> try to grab a reference to the `folio`, then if that worked
>> >> we check that the `page` still points to the same `folio`,
>> >> and then we recheck that the PTE is still the same.)
>> >> """
>> >
>> > Nngh.  In trying to make this description fit all kernels (with
>> > both pages and folios), you've complicated it maximally.  Let's
>> > try a more simple explanation:
>> >
>> > First we (locklessly) load the page table entry, then we grab a
>> > reference to the folio that contains it (which can fail if the
>> > refcount is zero, in that case we bail), then we recheck that the
>> > page table entry is still the same, and if it changed in between,
>> > we drop the folio reference and bail.
>> > This can, again, grab a reference to a folio after it has
>> > already been freed and reallocated. The reason why this is
>> > fine is that the metadata structure that holds this refcount,
>> > `struct folio` is never freed; even when a folio is
>> > freed and reallocated, the corresponding `struct folio`
>> > stays.
>
> Oh, thanks. You're right, trying to talk about kernels with folios
> made it unnecessarily complicated...
>
>> I'm probably missing something obvious but how is that synchronised
>> against memory hotplug? AFAICT if it isn't couldn't the pages be freed
>> and memory removed? In that case the above would no longer hold because
>> (I think) the metadata structure could have been freed.
>
> Hm... that's this codepath?
>
> arch_remove_memory -> __remove_pages -> __remove_section ->
> sparse_remove_section -> section_deactivate ->
> depopulate_section_memmap -> vmemmap_free -> remove_pagetable
> which then walks down the page tables and ends up freeing individual
> pages in remove_pte_table() using the confusingly-named
> free_pagetable()?

Right. section_deactivate() will also clear SECTION_HAS_MEM_MAP which
would trigger VM_BUG_ON(!pfn_valid(pte_pfn(pte))) in gup_pte_range().

> I'm not sure what the synchronization against hotplug is - GUP-fast is
> running with IRQs disabled, but other codepaths might not, like
> get_ksm_page()? I don't know if that's holding something else for protection...

I was thinking about this from the ZONE_DEVICE perspective (ie.
memunmap_pages -> pageunmap_range -> arch_remove_memory -> ...). That
runs with IRQs enabled, and I couldn't see any other synchronization.
pageunmap_range() does call mem_hotplug_begin() which takes hotplug
locks but GUP-fast doesn't take those locks. So based on Peter's
response I think I need to add a rcu_synchronize() call to
pageunmap_range() right after calling mem_hotplug_begin().

I could just add it to mem_hotplug_begin() but offline_pages() calls
that too early (before pages have been isolated) so will need a separate
change.