Jann Horn <jannh@xxxxxxxxxx> writes: > On Tue, Oct 25, 2022 at 10:11 AM Alistair Popple <apopple@xxxxxxxxxx> wrote: >> >> >> Matthew Wilcox <willy@xxxxxxxxxxxxx> writes: >> >> > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote: >> >> """ >> >> This guarantees that the page tables that are being walked >> >> aren't freed concurrently, but at the end of the walk, we >> >> have to grab a stable reference to the referenced page. >> >> For this we use the grab-reference-and-revalidate trick >> >> from above again: >> >> First we (locklessly) load the page >> >> table entry, then we grab a reference to the page that it >> >> points to (which can fail if the refcount is zero, in that >> >> case we bail), then we recheck that the page table entry >> >> is still the same, and if it changed in between, we drop the >> >> page reference and bail. >> >> This can, again, grab a reference to a page after it has >> >> already been freed and reallocated. The reason why this is >> >> fine is that the metadata structure that holds this refcount, >> >> `struct folio` (or `struct page`, depending on which kernel >> >> version you're looking at; in current kernels it's `folio` >> >> but `struct page` and `struct folio` are actually aliases for >> >> the same memory, basically, though that is supposed to maybe >> >> change at some point) is never freed; even when a page is >> >> freed and reallocated, the corresponding `struct folio` >> >> stays. This does have the fun consequence that whenever a >> >> page/folio has a non-zero refcount, the refcount can >> >> spuriously go up and then back down for a little bit. >> >> (Also it's technically not as simple as I just described it, >> >> because the `struct page` that the PTE points to might be >> >> a "tail page" of a `struct folio`. >> >> So actually we first read the PTE, the PTE gives us the >> >> `page*`, then from that we go to the `folio*`, then we >> >> try to grab a reference to the `folio`, then if that worked >> >> we check that the `page` still points to the same `folio`, >> >> and then we recheck that the PTE is still the same.) >> >> """ >> > >> > Nngh. In trying to make this description fit all kernels (with >> > both pages and folios), you've complicated it maximally. Let's >> > try a more simple explanation: >> > >> > First we (locklessly) load the page table entry, then we grab a >> > reference to the folio that contains it (which can fail if the >> > refcount is zero, in that case we bail), then we recheck that the >> > page table entry is still the same, and if it changed in between, >> > we drop the folio reference and bail. >> > This can, again, grab a reference to a folio after it has >> > already been freed and reallocated. The reason why this is >> > fine is that the metadata structure that holds this refcount, >> > `struct folio` is never freed; even when a folio is >> > freed and reallocated, the corresponding `struct folio` >> > stays. > > Oh, thanks. You're right, trying to talk about kernels with folios > made it unnecessarily complicated... > >> I'm probably missing something obvious but how is that synchronised >> against memory hotplug? AFAICT if it isn't couldn't the pages be freed >> and memory removed? In that case the above would no longer hold because >> (I think) the metadata structure could have been freed. > > Hm... that's this codepath? > > arch_remove_memory -> __remove_pages -> __remove_section -> > sparse_remove_section -> section_deactivate -> > depopulate_section_memmap -> vmemmap_free -> remove_pagetable > which then walks down the page tables and ends up freeing individual > pages in remove_pte_table() using the confusingly-named > free_pagetable()? Right. section_deactivate() will also clear SECTION_HAS_MEM_MAP which would trigger VM_BUG_ON(!pfn_valid(pte_pfn(pte))) in gup_pte_range(). > I'm not sure what the synchronization against hotplug is - GUP-fast is > running with IRQs disabled, but other codepaths might not, like > get_ksm_page()? I don't know if that's holding something else for protection... I was thinking about this from the ZONE_DEVICE perspective (ie. memunmap_pages -> pageunmap_range -> arch_remove_memory -> ...). That runs with IRQs enabled, and I couldn't see any other synchronization. pageunmap_range() does call mem_hotplug_begin() which takes hotplug locks but GUP-fast doesn't take those locks. So based on Peter's response I think I need to add a rcu_synchronize() call to pageunmap_range() right after calling mem_hotplug_begin(). I could just add it to mem_hotplug_begin() but offline_pages() calls that too early (before pages have been isolated) so will need a separate change.