On Tue, Oct 25, 2022 at 10:11 AM Alistair Popple <apopple@xxxxxxxxxx> wrote: > > > Matthew Wilcox <willy@xxxxxxxxxxxxx> writes: > > > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote: > >> """ > >> This guarantees that the page tables that are being walked > >> aren't freed concurrently, but at the end of the walk, we > >> have to grab a stable reference to the referenced page. > >> For this we use the grab-reference-and-revalidate trick > >> from above again: > >> First we (locklessly) load the page > >> table entry, then we grab a reference to the page that it > >> points to (which can fail if the refcount is zero, in that > >> case we bail), then we recheck that the page table entry > >> is still the same, and if it changed in between, we drop the > >> page reference and bail. > >> This can, again, grab a reference to a page after it has > >> already been freed and reallocated. The reason why this is > >> fine is that the metadata structure that holds this refcount, > >> `struct folio` (or `struct page`, depending on which kernel > >> version you're looking at; in current kernels it's `folio` > >> but `struct page` and `struct folio` are actually aliases for > >> the same memory, basically, though that is supposed to maybe > >> change at some point) is never freed; even when a page is > >> freed and reallocated, the corresponding `struct folio` > >> stays. This does have the fun consequence that whenever a > >> page/folio has a non-zero refcount, the refcount can > >> spuriously go up and then back down for a little bit. > >> (Also it's technically not as simple as I just described it, > >> because the `struct page` that the PTE points to might be > >> a "tail page" of a `struct folio`. > >> So actually we first read the PTE, the PTE gives us the > >> `page*`, then from that we go to the `folio*`, then we > >> try to grab a reference to the `folio`, then if that worked > >> we check that the `page` still points to the same `folio`, > >> and then we recheck that the PTE is still the same.) > >> """ > > > > Nngh. In trying to make this description fit all kernels (with > > both pages and folios), you've complicated it maximally. Let's > > try a more simple explanation: > > > > First we (locklessly) load the page table entry, then we grab a > > reference to the folio that contains it (which can fail if the > > refcount is zero, in that case we bail), then we recheck that the > > page table entry is still the same, and if it changed in between, > > we drop the folio reference and bail. > > This can, again, grab a reference to a folio after it has > > already been freed and reallocated. The reason why this is > > fine is that the metadata structure that holds this refcount, > > `struct folio` is never freed; even when a folio is > > freed and reallocated, the corresponding `struct folio` > > stays. Oh, thanks. You're right, trying to talk about kernels with folios made it unnecessarily complicated... > I'm probably missing something obvious but how is that synchronised > against memory hotplug? AFAICT if it isn't couldn't the pages be freed > and memory removed? In that case the above would no longer hold because > (I think) the metadata structure could have been freed. Hm... that's this codepath? arch_remove_memory -> __remove_pages -> __remove_section -> sparse_remove_section -> section_deactivate -> depopulate_section_memmap -> vmemmap_free -> remove_pagetable which then walks down the page tables and ends up freeing individual pages in remove_pte_table() using the confusingly-named free_pagetable()? I'm not sure what the synchronization against hotplug is - GUP-fast is running with IRQs disabled, but other codepaths might not, like get_ksm_page()? I don't know if that's holding something else for protection...