On 10/1/19 12:10 AM, Jan Kara wrote: > On Mon 30-09-19 10:57:08, John Hubbard wrote: >> On 9/30/19 2:20 AM, Jan Kara wrote: >>> On Fri 27-09-19 12:31:41, John Hubbard wrote: >>>> On 9/27/19 5:33 AM, Michal Hocko wrote: >>>>> On Thu 26-09-19 20:26:46, John Hubbard wrote: >>>>>> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote: >> ... >> 2. Your code above is after the "pmd = READ_ONCE(*pmdp)" line, so by then, >> it's already found the pte based on reading a stale pmd. So checking the >> pte seems like it's checking the wrong thing--it's too late, for this case, >> right? > > Well, if PMD is getting freed, all PTEs in it should be cleared by that > time, shouldn't they? So although we read from stale PMD, either we already > see cleared PTE or the check pte_val(pte) != pte_val(*ptep) will fail and > so we never actually succeed in getting stale PTE entry (at least unless > the page table page that used to be PMD can get freed and reused > - which is not the case in the example you've shown above). > Right, that's not what the example shows, but there is nothing here to prevent the page table pages from being freed and re-used. > So I still don't see a problem. That being said I don't feel being expert > in this particular area. I just always thought GUP prevents races like this > by the scheme I describe so I'd like to see what I'm missing :). > I'm very much still "in training" here, so I hope I'm not wasting everyone's time. But I feel confident in stating at least this much: There are two distinct lockless synchronization mechanisms here, each protecting against a different issue, and it's important not to conflate them and think that one protects against the other. I still see a hole in (2) below. The mechanisms are: 1) Protection against a page (not the page table itself) getting freed while get_user_pages*() is trying to take a reference to it. This is avoided by the try-get and the re-checking of pte values that you mention above. It's an elegant little thing, too. :) 2) Protection against page tables getting freed while a get_user_pages_fast() call is in progress. This relies on disabling interrupts in gup_fast(), while firing interrupts in the freeing path (via tlb flushing, IPIs). And on memory barriers (doh--missing!) to avoid leaking memory references outside of the irq disabling. This one has a problem, because Documentation/memory-barriers.txt points out: INTERRUPT DISABLING FUNCTIONS ----------------------------- Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts (RELEASE equivalent) will act as compiler barriers only. So if memory or I/O barriers are required in such a situation, they must be provided from some other means. ...and so I'm suggesting that we need something approximately like this: diff --git a/mm/gup.c b/mm/gup.c index 23a9f9c9d377..1678d50a2d8b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2415,7 +2415,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) { local_irq_disable(); + smp_mb(); gup_pgd_range(addr, end, gup_flags, pages, &nr); + smp_mb(); local_irq_enable(); ret = nr; thanks, -- John Hubbard NVIDIA