On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote: > """ > This guarantees that the page tables that are being walked > aren't freed concurrently, but at the end of the walk, we > have to grab a stable reference to the referenced page. > For this we use the grab-reference-and-revalidate trick > from above again: > First we (locklessly) load the page > table entry, then we grab a reference to the page that it > points to (which can fail if the refcount is zero, in that > case we bail), then we recheck that the page table entry > is still the same, and if it changed in between, we drop the > page reference and bail. > This can, again, grab a reference to a page after it has > already been freed and reallocated. The reason why this is > fine is that the metadata structure that holds this refcount, > `struct folio` (or `struct page`, depending on which kernel > version you're looking at; in current kernels it's `folio` > but `struct page` and `struct folio` are actually aliases for > the same memory, basically, though that is supposed to maybe > change at some point) is never freed; even when a page is > freed and reallocated, the corresponding `struct folio` > stays. This does have the fun consequence that whenever a > page/folio has a non-zero refcount, the refcount can > spuriously go up and then back down for a little bit. > (Also it's technically not as simple as I just described it, > because the `struct page` that the PTE points to might be > a "tail page" of a `struct folio`. > So actually we first read the PTE, the PTE gives us the > `page*`, then from that we go to the `folio*`, then we > try to grab a reference to the `folio`, then if that worked > we check that the `page` still points to the same `folio`, > and then we recheck that the PTE is still the same.) > """ Nngh. In trying to make this description fit all kernels (with both pages and folios), you've complicated it maximally. Let's try a more simple explanation: First we (locklessly) load the page table entry, then we grab a reference to the folio that contains it (which can fail if the refcount is zero, in that case we bail), then we recheck that the page table entry is still the same, and if it changed in between, we drop the folio reference and bail. This can, again, grab a reference to a folio after it has already been freed and reallocated. The reason why this is fine is that the metadata structure that holds this refcount, `struct folio` is never freed; even when a folio is freed and reallocated, the corresponding `struct folio` stays. This does have the fun consequence that whenever a folio has a non-zero refcount, the refcount can spuriously go up and then back down for a little bit. (Also it's slightly more complex than I just described, because the PTE that we just loaded might be in the middle of being reallocated into a different folio. So actually we first read the PTE, translate the PTE into the `page*`, then from that we go to the `folio*`, then we try to grab a reference to the `folio`, then if that worked we check that the `page` is still in the same `folio`, and then we recheck that the PTE is still the same. Older kernels did not make a clear distinction between pages and folios, so it was even more confusing.) Better?