[PATCH 00/13] mmu_notifier kill invalidate_page callback

jglisse@xxxxxxxxxx (Jerome Glisse) · Thu, 31 Aug 2017 15:00:21 -0400

On Thu, Aug 31, 2017 at 02:39:17PM -0400, Felix Kuehling wrote:
> On 2017-08-31 09:59 AM, Jerome Glisse wrote:
> > [Adding Intel folks as they might be interested in this discussion]
> >
> > On Wed, Aug 30, 2017 at 05:51:52PM -0400, Felix Kuehling wrote:
> >> Hi JÃ©rÃ´me,
> >>
> >> I have some questions about the potential range-start-end race you
> >> mentioned.
> >>
> >> On 2017-08-29 07:54 PM, JÃ©rÃ´me Glisse wrote:
> >>> Note that a lot of existing user feels broken in respect to range_start/
> >>> range_end. Many user only have range_start() callback but there is nothing
> >>> preventing them to undo what was invalidated in their range_start() callback
> >>> after it returns but before any CPU page table update take place.
> >>>
> >>> The code pattern use in kvm or umem odp is an example on how to properly
> >>> avoid such race. In a nutshell use some kind of sequence number and active
> >>> range invalidation counter to block anything that might undo what the
> >>> range_start() callback did.
> >> What happens when we start monitoring an address range after
> >> invaligate_range_start was called? Sounds like we have to keep track of
> >> all active invalidations for just such a case, even in address ranges
> >> that we don't currently care about.
> >>
> >> What are the things we cannot do between invalidate_range_start and
> >> invalidate_range_end? amdgpu calls get_user_pages to re-validate our
> >> userptr mappings after the invalidate_range_start notifier invalidated
> >> it. Do we have to wait for invalidate_range_end before we can call
> >> get_user_pages safely?
> > Well the whole userptr bo object is somewhat broken from the start.
> > You never defined the semantic of it ie what is expected. I can
> > think of 2 differents semantics:
> >   A) a uptr buffer object is a snapshot of a memory at the time of
> >      uptr buffer object creation
> >   B) a uptr buffer object allow GPU to access a range of virtual
> >      address of a process an share coherent view of that range
> >      between CPU and GPU
> >
> > As it was implemented it is more inline with B but it is not defined
> > anywhere AFAICT.
> 
> Yes, we're trying to achieve B, that's why we have an MMU notifier in
> the first place. But it's been a struggle getting it to work properly,
> and we're still dealing with some locking issues and now this one.
> 
> >
> > Anyway getting back to your questions, it kind of doesn't matter as
> > you are using GUP ie you are pinning pages except for one scenario
> > (at least i can only think of one).
> >
> > Problematic case is race between CPU write to zero page or COW and
> > GPU driver doing read only GUP:
> [...]
> 
> Thanks, I was aware of COW but not of the zero-page case. I believe in
> most practical cases our userptr mappings are read-write, so this is
> probably not causing us any real trouble at the moment.
> 
> > So i would first define the semantic of uptr bo and then i would fix
> > accordingly the code. Semantic A is easier to implement and you could
> > just drop the whole mmu_notifier. Maybe it is better to create uptr
> > buffer object everytime you want to snapshot a range of address. I
> > don't think the overhead of buffer creation would matter.
> 
> That doesn't work for KFD and our compute memory model where CPU and GPU
> expect to share the same address space.
> 
> >
> > If you want to close the race for COW and zero page in case of read
> > only GUP there is no other way than what KVM or ODP is doing. I had
> > patchset to simplify all this but i need to bring it back to life.
> 
> OK. I'll look at these to get an idea.
> 
> > Note that other thing might race but as you pin the pages they do
> > not matter. It just mean that if you GUP after range_start() but
> > before range_end() and before CPU page table update then you pinned
> > the same old page again and nothing will happen (migrate will fail,
> > MADV_FREE will nop, ...). So you just did the range_start() callback
> > for nothing in those cases.
> 
> We pin the memory because the GPU wants to access it. So this is
> probably OK if the migration fails. However, your statement that we
> "just did the range_start() callback for nothing" implies that we could
> as well have ignored the range_start callback. But I don't think that's
> true. That way we'd keep a page pinned that is no longer mapped in the
> CPU address space. So the GPU mapping would be out of sync with the CPU
> mapping.

Here is an example of "for nothing":
    CPU0                                CPU1
    > migrate page at addr A
      > invalidate_start addr A
        > unbind_ttm(for addr A)
                                        > use ttm object for addr A
                                        > GUP addr A
      > page table update
      > invalidate_end addr A
      > refcount check fails because
        of GUP
      > restore page table to A

This is what i meant by range_start() for nothing on CPU0 you invalidated
ttm object for nothing but this is because of a race. Fixing COW race
would also fix this race and avoid migration to fail because GPU GUP.

I was not saying you should not use mmu_notifier. For achieving B you need
mmu_notifier. Note that if you do like ODP/KVM then you do not need to
pin page.

Cheers,
JÃ©rÃ´me