Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx> · Mon, 6 Jun 2022 12:30:16 +0200

On Mon, Jun 6, 2022 at 12:15 PM Christian König
<christian.koenig@xxxxxxx> wrote:
>
>
>
> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
> > On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@xxxxxxx> wrote:
> >> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
> >>> [SNIP]
> >>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> >>>>> For this series, not really.  To clarify there are two sides for
> >>>>> getting GPU bubbles and no overlap:
> >>>>>
> >>>>> (1) VM operations implicitly wait for earlier CS submissions
> >>>>> (2) CS submissions implicitly wait for earlier VM operations
> >>>>>
> >>>>> Together, these combine to ensure that you get a (potentially small)
> >>>>> bubble any time VM work happens.
> >>>>>
> >>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
> >>>>> to do. However, while writing the userspace for this I noticed this
> >>>>> isn't enough to get rid of all our GPU bubbles. In particular when
> >>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
> >>>>> for the next CS anyway for API semantics. Due to VM operations
> >>>>> happening on a single timeline that means this high priority map can
> >>>>> end up being blocked by earlier sparse maps and hence the bubble in
> >>>>> that case still exists.
> >>>>>
> >>>>> So in this series I try to tackle (1) instead. Since GPU work
> >>>>> typically lags behind CPU submissions and VM operations aren't that
> >>>>> slow, we can typically execute VM operations early enough that any
> >>>>> implicit syncs from (2) are less/no issue.
> >>>> Ok, once more since you don't seem to understand what I want to say: It
> >>>> isn't possible to fix #1 before you have fixed #2.
> >>>>
> >>>> The VM unmap operation here is a barrier which divides the CS operations
> >>>> in a before and after. This is intentional design.
> >>> Why is that barrier needed? The two barriers I got and understood and
> >>> I think we can deal with:
> >>>
> >>> 1) the VM unmap is a barrier between prior CS and later memory free.
> >>> 2) The TLB flush need to happen between a VM unmap and later CS.
> >>>
> >>> But why do we need the VM unmap to be a strict barrier between prior
> >>> CS and later CS?
> >> Exactly because of the two reasons you mentioned.
> > This is the part I'm not seeing. I get that removing #2 is a
> > nightmare, which is why I did something that doesn't violate that
> > constraint.
> >
> > Like if an explicit CS that was running before the VM operation  runs
> > till after the VM operation (and hence possibly till after the TLB
> > flush, or otherwise have the TLB flush not apply due to lack of async
> > TLB flush support), that is not an issue. It might see the state from
> > before the unmap, or after the unmap, or some intermediate state and
> > all of those would be okay.
> >
> > We still get the constraint that the TLB flush happens between the VM
> > unmap and later CS and hence the unmap is certainly visible to them.
>
> So you basically just want to set the sync mode in
> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?

Yes, with the caveat that I want to do that only for
DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with
implicit sync we get the old implicit behavior, if we submit a CS with
explicit sync we get the new explicit behavior). The rest of the
series is basically just for enabling explicit sync submissions.

> That should be doable, but then you don't need any of the other changes.
>
> Regards,
> Christian.
>
> >
> >> #1 Is rather easy to fix, you just need to copy all dma_fences from the
> >> page table dma_resv object over to the BOs dma_resv object in the gem
> >> close handler. E.g. exactly what you suggested with the dma_resv_copy
> >> function.
> >>
> >> #2 is a nightmare.
> >>
> >> We can't move the TLB flush at the end of the unmap operation because on
> >> async TLB flushes are either a bit complicated (double flushes etc..) or
> >> don't even work at all because of hw bugs. So to have a reliable TLB
> >> flush we must make sure that nothing else is ongoing and that means
> >> CS->VM->CS barrier.
> >>
> >> We try very hard to circumvent that already on maps by (for example)
> >> using a completely new VMID for CS after the VM map operation.
> >>
> >> But for the unmap operation we would need some kind special dma_fence
> >> implementation which would not only wait for all existing dma_fence but
> >> also for the one added until the unmap operation is completed. Cause
> >> otherwise our operation we do at #1 would simply not catch all
> >> dma_fences which have access to the memory.
> >>
> >> That's certainly doable, but I think just using the drm_exec stuff I
> >> already came up with is easier.
> >>
> >> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
> >> goes away and we can keep track of the unmap operations in the bo_va
> >> structure.
> >>
> >> With that done you can make the explicit sync you noted in the bo_va
> >> structure and implicit sync when the bo_va structure goes away.
> >>
> >> Then the only reason I can see why we would need a CS->VM dependency is
> >> implicit synchronization, and that's what we are trying to avoid here in
> >> the first place.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>> To get rid of this barrier you must first fix the part where CS
> >>>> submissions wait for the VM operation to complete, e.g. the necessity of
> >>>> the barrier.
> >>>>
> >>>> I'm working on this for a couple of years now and I'm really running out
> >>>> of idea how to explain this restriction.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
>