Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

Daniel Vetter <daniel@xxxxxxxx> · Mon, 26 Mar 2018 10:01:21 +0200

On Thu, Mar 22, 2018 at 10:58:55AM +0100, Christian König wrote:
> Am 22.03.2018 um 08:18 schrieb Daniel Vetter:
> > On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote:
> > > Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
> > > > On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
> > > > > Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
> > > > > > [SNIP]
> > > > > > For the in-driver reservation path (CS) having a slow-path that grabs a
> > > > > > temporary reference, drops the vram lock and then locks the reservation
> > > > > > normally (using the acquire context used already for the entire CS) is a
> > > > > > bit tricky, but totally feasible. Ttm doesn't do that though.
> > > > > That is exactly what we do in amdgpu as well, it's just not very efficient
> > > > > nor reliable to retry getting the right pages for a submission over and over
> > > > > again.
> > > > Out of curiosity, where's that code? I did read the ttm eviction code way
> > > > back, and that one definitely didn't do that. Would be interesting to
> > > > update my understanding.
> > > That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with
> > > grabbing, releasing and regrabbing locks in a loop.
> > > 
> > > Then in amdgpu_cs_submit() we grab an lock preventing page table updates and
> > > check if all pages are still the one we want to have:
> > > >          amdgpu_mn_lock(p->mn);
> > > >          if (p->bo_list) {
> > > >                  for (i = p->bo_list->first_userptr;
> > > >                       i < p->bo_list->num_entries; ++i) {
> > > >                          struct amdgpu_bo *bo = p->bo_list->array[i].robj;
> > > > 
> > > >                          if
> > > > (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) {
> > > >                                  amdgpu_mn_unlock(p->mn);
> > > >                                  return -ERESTARTSYS;
> > > >                          }
> > > >                  }
> > > >          }
> > > If anything changed on the page tables we restart the whole IOCTL using
> > > -ERESTARTSYS and try again.
> > I'm not talking about userptr here, but general bo eviction. Sorry for the
> > confusion.
> > 
> > The reason I'm dragging all the general bo management into this
> > discussions is because we do seem to have fairly fundamental difference in
> > how that's done, with resulting consequences for the locking hierarchy.
> > 
> > And if this invalidate_mapping stuff should work, together with userptr
> > and everything else, I think we're required to agree on how this is all
> > supposed to nest, and how exactly we should back off for the other side
> > that needs to break the locking circle.
> > 
> > That aside, I don't entirely understand why you need to restart so much. I
> > figured that get_user_pages is ordered correctly against mmu
> > invalidations, but I get the impression you think that's not the case. How
> > does that happen?
> 
> Correct. I've had the same assumption, but both Jerome as well as our
> internal tests proved me wrong on that.
> 
> Key take away from that was that you can't take any locks from neither the
> MMU notifier nor the shrinker you also take while calling kmalloc (or
> simpler speaking get_user_pages()).
> 
> Additional to that in the MMU or shrinker callback all different kinds of
> locks might be held, so you basically can't assume that you do thinks like
> recursive page table walks or call dma_unmap_anything.

That sounds like a design bug in mmu_notifiers, since it would render them
useless for KVM. And they were developed for that originally. I think I'll
chat with Jerome to understand this, since it's all rather confusing.

> Thinking a moment about that it actually seems to make perfect sense.
> So it doesn't matter what order you got between the mmap_sem and your buffer
> or allocation lock, it will simply be incorrect with other locks in the
> system anyway.

Hm, doesn't make sense to me, at least from a locking  inversion pov. I
thought the only locks that are definitely part of the mmu_nofitier
callback contexts are the mm locks. We're definitely fine with those, and
kmalloc didn't bite us yet. Direct reclaim is really limited in what it
can accomplish for good reasons.

So the only thing I'm worried here is that we have a race between the
invalidate and the gup calls, and I think fixing that race will make the
locking more fun. But again, if that means you can'd dma_unmap stuff then
that feels like mmu notifiers are broken by design.
-Daniel

> The only solution to that problem we have found is the dance we have in
> amdgpu now:
> 
> 1. Inside invalidate_range_start you grab a recursive read side lock.
> 2. Wait for all GPU operation on that pages to finish.
> 3. Then mark all pages as dirty and accessed.
> 4. Inside invalidate_range_end you release your recursive read side lock
> again.
> 
> Then during command submission you do the following:
> 
> 1. Take the locks Figure out all the userptr BOs which needs new pages.
> 2. Drop the locks again call get_user_pages().
> 3. Install the new page arrays and reiterate with #1
> 4. As soon as everybody has pages update your page tables and prepare all
> command submission.
> 5. Take the write side of our recursive lock.
> 6. Check if all pages are still valid, if not restart the whole IOCTL.
> 7. Submit the job to the hardware.
> 8. Drop the write side of our recursive lock.
> 
> Regards,
> Christian.
> 
> > -Daniel
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch