On 8/30/22 11:53, David Hildenbrand wrote: > Good, I managed to attract the attention of someone who understands that machinery :) > > While validating whether GUP-fast and PageAnonExclusive code work correctly, > I started looking at the whole RCU GUP-fast machinery. I do have a patch to > improve PageAnonExclusive clearing (I think we're missing memory barriers to > make it work as expected in any possible case), but I also stumbled eventually > over a more generic issue that might need memory barriers. > > Any thoughts whether I am missing something or this is actually missing > memory barriers? > It's actually missing memory barriers. In fact, others have had that same thought! [1] :) In that 2019 thread, I recall that this got dismissed because of a focus on the IPI-based aspect of gup fast synchronization (there was some hand waving, perhaps accurate waving, about memory barriers vs. CPU interrupts). But now the RCU (non-IPI) implementation is more widely used than it used to be, the issue is clearer. > > From ce8c941c11d1f60cea87a3e4d941041dc6b79900 Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <david@xxxxxxxxxx> > Date: Mon, 29 Aug 2022 16:57:07 +0200 > Subject: [PATCH] mm/gup: update refcount+pincount before testing if the PTE > changed > > mm/ksm.c:write_protect_page() has to make sure that no unknown > references to a mapped page exist and that no additional ones with write > permissions are possible -- unknown references could have write permissions > and modify the page afterwards. > > Conceptually, mm/ksm.c:write_protect_page() consists of: > (1) Clear/invalidate PTE > (2) Check if there are unknown references; back off if so. > (3) Update PTE (e.g., map it R/O) > > Conceptually, GUP-fast code consists of: > (1) Read the PTE > (2) Increment refcount/pincount of the mapped page > (3) Check if the PTE changed by re-reading it; back off if so. > > To make sure GUP-fast won't be able to grab additional references after > clearing the PTE, but will properly detect the change and back off, we > need a memory barrier between updating the recount/pincount and checking > if it changed. > > try_grab_folio() doesn't necessarily imply a memory barrier, so add an > explicit smp_mb__after_atomic() after the atomic RMW operation to > increment the refcount and pincount. > > ptep_clear_flush() used to clear the PTE and flush the TLB should imply > a memory barrier for flushing the TLB, so don't add another one for now. > > PageAnonExclusive handling requires further care and will be handled > separately. > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> > --- > mm/gup.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/mm/gup.c b/mm/gup.c > index 5abdaf487460..0008b808f484 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2392,6 +2392,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > goto pte_unmap; > } > > + /* > + * Update refcount/pincount before testing for changed PTE. This > + * is required for code like mm/ksm.c:write_protect_page() that > + * wants to make sure that a page has no unknown references > + * after clearing the PTE. > + */ > + smp_mb__after_atomic(); > + > if (unlikely(pte_val(pte) != pte_val(*ptep))) { > gup_put_folio(folio, 1, flags); > goto pte_unmap; > @@ -2577,6 +2585,9 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, > if (!folio) > return 0; > > + /* See gup_pte_range(). */ Don't we usually also identify what each mb pairs with, in the comments? That would help. > + smp_mb__after_atomic(); > + > if (unlikely(pte_val(pte) != pte_val(*ptep))) { > gup_put_folio(folio, refs, flags); > return 0; > @@ -2643,6 +2654,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, > if (!folio) > return 0; > > + /* See gup_pte_range(). */ > + smp_mb__after_atomic(); > + > if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { > gup_put_folio(folio, refs, flags); > return 0; > @@ -2683,6 +2697,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, > if (!folio) > return 0; > > + /* See gup_pte_range(). */ > + smp_mb__after_atomic(); > + > if (unlikely(pud_val(orig) != pud_val(*pudp))) { > gup_put_folio(folio, refs, flags); > return 0; [1] https://lore.kernel.org/lkml/9465df76-0229-1b44-5646-5cced1bc1718@xxxxxxxxxx/ thanks, -- John Hubbard NVIDIA