On Tue 01-10-19 11:43:30, John Hubbard wrote: > On 10/1/19 12:10 AM, Jan Kara wrote: > > On Mon 30-09-19 10:57:08, John Hubbard wrote: > >> On 9/30/19 2:20 AM, Jan Kara wrote: > >>> On Fri 27-09-19 12:31:41, John Hubbard wrote: > >>>> On 9/27/19 5:33 AM, Michal Hocko wrote: > >>>>> On Thu 26-09-19 20:26:46, John Hubbard wrote: > >>>>>> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote: > >> ... > >> 2. Your code above is after the "pmd = READ_ONCE(*pmdp)" line, so by then, > >> it's already found the pte based on reading a stale pmd. So checking the > >> pte seems like it's checking the wrong thing--it's too late, for this case, > >> right? > > > > Well, if PMD is getting freed, all PTEs in it should be cleared by that > > time, shouldn't they? So although we read from stale PMD, either we already > > see cleared PTE or the check pte_val(pte) != pte_val(*ptep) will fail and > > so we never actually succeed in getting stale PTE entry (at least unless > > the page table page that used to be PMD can get freed and reused > > - which is not the case in the example you've shown above). > > Right, that's not what the example shows, but there is nothing here to prevent > the page table pages from being freed and re-used. > > > So I still don't see a problem. That being said I don't feel being expert > > in this particular area. I just always thought GUP prevents races like this > > by the scheme I describe so I'd like to see what I'm missing :). > > I'm very much still "in training" here, so I hope I'm not wasting everyone's > time. But I feel confident in stating at least this much: > > There are two distinct lockless synchronization mechanisms here, each > protecting against a different issue, and it's important not to conflate > them and think that one protects against the other. I still see a hole in > (2) below. The mechanisms are: > > 1) Protection against a page (not the page table itself) getting freed > while get_user_pages*() is trying to take a reference to it. This is avoided > by the try-get and the re-checking of pte values that you mention above. > It's an elegant little thing, too. :) > > 2) Protection against page tables getting freed while a get_user_pages_fast() > call is in progress. This relies on disabling interrupts in gup_fast(), while > firing interrupts in the freeing path (via tlb flushing, IPIs). And on memory > barriers (doh--missing!) to avoid leaking memory references outside of the > irq disabling. OK, so you are concerned that the page table walk of pgd->p4d->pud->pmd will get prefetched by the CPU before interrupts are disabled, somewhere in the middle of the walk IPI flushing TLBs on the cpu will happen allowing munmap() on another CPU to proceed and free page tables so the rest of the walk will happen on freed and possibly reused pages. Do I understand you right? Realistically, I don't think this can happen as I'd expect the CPU to throw away the speculation state on interrupt. But that's just my expectation and I agree that I don't see anything in Documentation/memory-barriers.txt that would prevent what you are concerned about. Let's ask Paul :) Paul, we are discussing here a possible races between mm/gup.c:__get_user_pages_fast() and mm/mmap.c:unmap_region(). The first has a code like: local_irq_save(flags); load pgd from current->mm load p4d from pgd load pud from p4d load pmd from pud ... local_irq_restore(flags); while the second has code like: unmap_region() walk pgd walk p4d walk pud walk pmd clear ptes flush tlb free page tables Now the serialization between these two relies on the fact that flushing TLBs from unmap_region() requires IPI to be served on each CPU so in naive understanding unmap_region() shouldn't be able to get to 'free unused page tables' part until __get_user_pages_fast() enables interrupts again. Now as John points out we don't see anything in Documentation/memory-barriers.txt that would actually guarantee this. So do we really need something like smp_rmb() after disabling interrupts in __get_user_pages_fast() or is the race John is concerned about impossible? Thanks! > This one has a problem, because Documentation/memory-barriers.txt points out: > > INTERRUPT DISABLING FUNCTIONS > ----------------------------- > > Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts > (RELEASE equivalent) will act as compiler barriers only. So if memory or I/O > barriers are required in such a situation, they must be provided from some > other means. > > > ...and so I'm suggesting that we need something approximately like this: > > diff --git a/mm/gup.c b/mm/gup.c > index 23a9f9c9d377..1678d50a2d8b 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2415,7 +2415,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, > if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && > gup_fast_permitted(start, end)) { > local_irq_disable(); > + smp_mb(); > gup_pgd_range(addr, end, gup_flags, pages, &nr); > + smp_mb(); > local_irq_enable(); > ret = nr; Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR