On Mon, Sep 5, 2022 at 12:59 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 05.09.22 00:29, John Hubbard wrote: > > On 9/1/22 15:27, Yang Shi wrote: > >> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > >> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > >> sufficient to handle concurrent GUP-fast in all cases, it only handles > >> traditional IPI-based GUP-fast correctly. On architectures that send > >> an IPI broadcast on TLB flush, it works as expected. But on the > >> architectures that do not use IPI to broadcast TLB flush, it may have > >> the below race: > >> > >> CPU A CPU B > >> THP collapse fast GUP > >> gup_pmd_range() <-- see valid pmd > >> gup_pte_range() <-- work on pte > >> pmdp_collapse_flush() <-- clear pmd and flush > >> __collapse_huge_page_isolate() > >> check page pinned <-- before GUP bump refcount > >> pin the page > >> check PTE <-- no change > >> __collapse_huge_page_copy() > >> copy data to huge page > >> ptep_clear() > >> install huge pmd for the huge page > >> return the stale page > >> discard the stale page > > > > Hi Yang, > > > > Thanks for taking the trouble to write down these notes. I always > > forget which race we are dealing with, and this is a great help. :) > > > > More... > > > >> > >> The race could be fixed by checking whether PMD is changed or not after > >> taking the page pin in fast GUP, just like what it does for PTE. If the > >> PMD is changed it means there may be parallel THP collapse, so GUP > >> should back off. > >> > >> Also update the stale comment about serializing against fast GUP in > >> khugepaged. > >> > >> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > >> Signed-off-by: Yang Shi <shy828301@xxxxxxxxx> > >> --- > >> mm/gup.c | 30 ++++++++++++++++++++++++------ > >> mm/khugepaged.c | 10 ++++++---- > >> 2 files changed, 30 insertions(+), 10 deletions(-) > >> > >> diff --git a/mm/gup.c b/mm/gup.c > >> index f3fc1f08d90c..4365b2811269 100644 > >> --- a/mm/gup.c > >> +++ b/mm/gup.c > >> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > >> } > >> > >> #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > >> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > >> - unsigned int flags, struct page **pages, int *nr) > >> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > >> + unsigned long end, unsigned int flags, > >> + struct page **pages, int *nr) > >> { > >> struct dev_pagemap *pgmap = NULL; > >> int nr_start = *nr, ret = 0; > >> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > >> goto pte_unmap; > >> } > >> > >> - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > >> + /* > >> + * THP collapse conceptually does: > >> + * 1. Clear and flush PMD > >> + * 2. Check the base page refcount > >> + * 3. Copy data to huge page > >> + * 4. Clear PTE > >> + * 5. Discard the base page > >> + * > >> + * So fast GUP may race with THP collapse then pin and > >> + * return an old page since TLB flush is no longer sufficient > >> + * to serialize against fast GUP. > >> + * > >> + * Check PMD, if it is changed just back off since it > >> + * means there may be parallel THP collapse. > >> + */ > > > > As I mentioned in the other thread, it would be a nice touch to move > > such discussion into the comment header. > > > >> + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || > >> + unlikely(pte_val(pte) != pte_val(*ptep))) { > > > > > > That should be READ_ONCE() for the *pmdp and *ptep reads. Because this > > whole lockless house of cards may fall apart if we try reading the > > page table values without READ_ONCE(). > > I came to the conclusion that the implicit memory barrier when grabbing > a reference on the page is sufficient such that we don't need READ_ONCE > here. > > If we still intend to change that code, we should fixup all GUP-fast > functions in a similar way. But again, I don't think we need a change here. > > > >> - * After this gup_fast can't run anymore. This also removes > >> - * any huge TLB entry from the CPU so we won't allow > >> - * huge and small TLB entries for the same virtual address > >> - * to avoid the risk of CPU bugs in that area. > >> + * This removes any huge TLB entry from the CPU so we won't allow > >> + * huge and small TLB entries for the same virtual address to > >> + * avoid the risk of CPU bugs in that area. > >> + * > >> + * Parallel fast GUP is fine since fast GUP will back off when > >> + * it detects PMD is changed. > >> */ > >> _pmd = pmdp_collapse_flush(vma, address, pmd); > > > > To follow up on David Hildenbrand's note about this in the nearby thread... > > I'm also not sure if pmdp_collapse_flush() implies a memory barrier on > > all arches. It definitely does do an atomic op with a return value on x86, > > but that's just one arch. > > > > I think a ptep/pmdp clear + TLB flush really has to imply a memory > barrier, otherwise TLB flushing code might easily mess up with > surrounding code. But we should better double-check. > > s390x executes an IDTE instruction, which performs serialization (-> > memory barrier). arm64 seems to use DSB instructions to enforce memory > ordering. IIUC we just need to care about the architectures which support both THP and fast-GUP, right? If so I got the below list: Loongarch - I didn't see explicit memory barrier, but it does IPI MIPS - as same as Loongarch PowerPC - has memory barrier x86 - atomic operations imply memory barrier Arm - I didn't see explicit memory barrier, not sure if ARM's tlb flush instructions imply memory barrier or not, but it does IPI Arm64 - uses DSB as David mentioned s390 - executes IDTE as David mentioned So I think the questionable architectures are Loongarch, MIPS and ARM. And IPI is not called if the PMD entry is not present on a remote CPU. So they may have race against fast GUP IIUC. But we'd better double check with the maintainers. > > -- > Thanks, > > David / dhildenb >