On Mon 30-09-19 10:57:08, John Hubbard wrote: > On 9/30/19 2:20 AM, Jan Kara wrote: > > On Fri 27-09-19 12:31:41, John Hubbard wrote: > > > On 9/27/19 5:33 AM, Michal Hocko wrote: > > > > On Thu 26-09-19 20:26:46, John Hubbard wrote: > > > > > On 9/26/19 3:20 AM, Kirill A. Shutemov wrote: > ... > > > CPU 0 CPU 1 > > > -------- --------------- > > > get_user_pages_fast() > > > > > > do_unmap() > > > unmap_region() > > > free_pgtables() > > > /* > > > * speculative reads, re-ordered > > > * by the CPU at run time, not > > > * compile time. The function calls > > > * are not really in this order, but > > > * the corresponding reads could be. > > > */ > > > gup_pgd_range() > > > gup_p4d_range() > > > gup_pud_range() > > > gup_pmd_range() > > > pmd = READ_ONCE(*pmdp) > > > /* This is stale */ > > > > > > tlb_finish_mmu() > > > tlb_flush_mmu() > > > tlb_flush_mmu_tlbonly() > > > tlb_flush() > > > flush_tlb_mm > > > flush_tlb_mm_range > > > flush_tlb_others > > > native_flush_tlb_others > > > smp_call_function_many: IPIs > > > ...blocks until CPU1 reenables > > > interrupts > > > > > > local_irq_disable() > > > ...continue the page walk based > > > on stale/freed data... > > > > Yes, but then we have: > > > > head = try_get_compound_head(page, 1); > > > > which has an atomic operation providing barrier in it and then we have: > > > > if (unlikely(pte_val(pte) != pte_val(*ptep))) { > > put_page(head); > > goto pte_unmap; > > } > > > > So we reload PTE again and check the value didn't change. Which should > > prevent the race you explain above. Or do I miss anything? > > > Well, a couple of questions: > > 1. Is there *really* a memory barrier in try_get_compound_head()? Because > I only see a READ_ONCE compile barrier, which would mean that run time > reordering is still possible. try_get_compound_head() has page_cache_add_speculative() which is atomic_add_unless() which is guaranteed to provide ordering. > 2. Your code above is after the "pmd = READ_ONCE(*pmdp)" line, so by then, > it's already found the pte based on reading a stale pmd. So checking the > pte seems like it's checking the wrong thing--it's too late, for this case, > right? Well, if PMD is getting freed, all PTEs in it should be cleared by that time, shouldn't they? So although we read from stale PMD, either we already see cleared PTE or the check pte_val(pte) != pte_val(*ptep) will fail and so we never actually succeed in getting stale PTE entry (at least unless the page table page that used to be PMD can get freed and reused - which is not the case in the example you've shown above). So I still don't see a problem. That being said I don't feel being expert in this particular area. I just always thought GUP prevents races like this by the scheme I describe so I'd like to see what I'm missing :). Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR