On 02.09.22 00:27, Yang Shi wrote: > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > sufficient to handle concurrent GUP-fast in all cases, it only handles > traditional IPI-based GUP-fast correctly. On architectures that send > an IPI broadcast on TLB flush, it works as expected. But on the > architectures that do not use IPI to broadcast TLB flush, it may have > the below race: > > CPU A CPU B > THP collapse fast GUP > gup_pmd_range() <-- see valid pmd > gup_pte_range() <-- work on pte > pmdp_collapse_flush() <-- clear pmd and flush > __collapse_huge_page_isolate() > check page pinned <-- before GUP bump refcount > pin the page > check PTE <-- no change > __collapse_huge_page_copy() > copy data to huge page > ptep_clear() > install huge pmd for the huge page > return the stale page > discard the stale page > > The race could be fixed by checking whether PMD is changed or not after > taking the page pin in fast GUP, just like what it does for PTE. If the > PMD is changed it means there may be parallel THP collapse, so GUP > should back off. > > Also update the stale comment about serializing against fast GUP in > khugepaged. > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > Signed-off-by: Yang Shi <shy828301@xxxxxxxxx> > --- > mm/gup.c | 30 ++++++++++++++++++++++++------ > mm/khugepaged.c | 10 ++++++---- > 2 files changed, 30 insertions(+), 10 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index f3fc1f08d90c..4365b2811269 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > } > > #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > - unsigned int flags, struct page **pages, int *nr) > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > + unsigned long end, unsigned int flags, > + struct page **pages, int *nr) > { > struct dev_pagemap *pgmap = NULL; > int nr_start = *nr, ret = 0; > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > goto pte_unmap; > } > > - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > + /* > + * THP collapse conceptually does: > + * 1. Clear and flush PMD > + * 2. Check the base page refcount > + * 3. Copy data to huge page > + * 4. Clear PTE > + * 5. Discard the base page > + * > + * So fast GUP may race with THP collapse then pin and > + * return an old page since TLB flush is no longer sufficient > + * to serialize against fast GUP. > + * > + * Check PMD, if it is changed just back off since it > + * means there may be parallel THP collapse. > + */ > + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || > + unlikely(pte_val(pte) != pte_val(*ptep))) { > gup_put_folio(folio, 1, flags); > goto pte_unmap; > } > @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > * get_user_pages_fast_only implementation that can pin pages. Thus it's still > * useful to have gup_huge_pmd even if we can't operate on ptes. > */ > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > - unsigned int flags, struct page **pages, int *nr) > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > + unsigned long end, unsigned int flags, > + struct page **pages, int *nr) > { > return 0; > } > @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo > if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr, > PMD_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) > + } else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) > return 0; > } while (pmdp++, addr = next, addr != end); > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 2d74cf01f694..518b49095db3 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ > /* > - * After this gup_fast can't run anymore. This also removes > - * any huge TLB entry from the CPU so we won't allow > - * huge and small TLB entries for the same virtual address > - * to avoid the risk of CPU bugs in that area. > + * This removes any huge TLB entry from the CPU so we won't allow > + * huge and small TLB entries for the same virtual address to > + * avoid the risk of CPU bugs in that area. > + * > + * Parallel fast GUP is fine since fast GUP will back off when > + * it detects PMD is changed. > */ > _pmd = pmdp_collapse_flush(vma, address, pmd); > spin_unlock(pmd_ptl); As long as pmdp_collapse_flush() implies a full memory barrier (which I think it does), this should work as expected. Hopefully someone with experience on RCU GUP-fast (Jason, John? :) ) can double-check. To me this sound sane. Acked-by: David Hildenbrand <david@xxxxxxxxxx> -- Thanks, David / dhildenb