Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx> · Fri, 3 Jun 2022 19:50:40 +0200

On Fri, Jun 3, 2022 at 7:42 PM Christian König <christian.koenig@xxxxxxx> wrote:
>
> Am 03.06.22 um 15:23 schrieb Bas Nieuwenhuizen:
> > On Fri, Jun 3, 2022 at 2:49 PM Christian König <christian.koenig@xxxxxxx> wrote:
> >> Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:
> >>> On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@xxxxxxx> wrote:
> >>>> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
> >>>>> On Fri, Jun 3, 2022 at 12:16 PM Christian König
> >>>>> <christian.koenig@xxxxxxx> wrote:
> >>>>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
> >>>>>>> [SNIP]
> >>>>>>>>> I do have to fix some stuff indeed, especially for the GEM close but
> >>>>>>>>> with that we should be able to keep the same basic approach?
> >>>>>>>> Nope, not even remotely.
> >>>>>>>>
> >>>>>>>> What we need is the following:
> >>>>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
> >>>>>>>> 2. When we get a VM operation we not only lock the VM page tables, but
> >>>>>>>> also all buffers we potentially need to unmap.
> >>>>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
> >>>>>>>> areas directly when they are unmapped.
> >>>>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
> >>>>>>>> combination.
> >>>>>>>> 5. When the bo_va structure is destroy because of closing the handle
> >>>>>>>> move the last clear operation over to the VM as implicit sync.
> >>>>>>>>
> >>>>>>> Hi Christian, isn't that a different problem though (that we're also
> >>>>>>> trying to solve, but in your series)?
> >>>>>>>
> >>>>>>> What this patch tries to achieve:
> >>>>>>>
> >>>>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>>>>>> (t+1) a VM operation on a BO/VM accessed by the CS.
> >>>>>>>
> >>>>>>> to run concurrently. What it *doesn't* try is
> >>>>>>>
> >>>>>>> (t+0) a VM operation on a BO/VM accessed by the CS.
> >>>>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>>>>>>
> >>>>>>> to run concurrently. When you write
> >>>>>>>
> >>>>>>>> Only when all this is done we then can resolve the dependency that the
> >>>>>>>> CS currently must wait for any clear operation on the VM.
> >>>>>>> isn't that all about the second problem?
> >>>>>> No, it's the same.
> >>>>>>
> >>>>>> See what we do in the VM code is to artificially insert a bubble so that
> >>>>>> all VM clear operations wait for all CS operations and then use the
> >>>>>> clear fence to indicate when the backing store of the BO can be freed.
> >>>>> Isn't that remediated with something like the code below? At least the
> >>>>> gem_close case should be handled with this, and the move case was
> >>>>> already handled by the copy operation.
> >>>> That is one necessary puzzle piece, yes. But you need more than that.
> >>>>
> >>>> Especially the explicit unmap operation needs to be converted into an
> >>>> implicit unmap to get the TLB flush right.
> >>> This doesn't change anything about the TLB flush though? Since all
> >>> unmap -> later jobs dependencies are still implicit.
> >>>
> >>> So the worst what could happen (i.f. e.g. userspace gets the
> >>> waits/dependencies wrong) is
> >>>
> >>> 1) non-implicit CS gets submitted that touches a BO
> >>> 2)  VM unmap on that BO happens
> >>> 2.5) the CS from 1 is still active due to missing dependencies
> >>> 2.6) but any CS submission after 2 will trigger a TLB flush
> >> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> > For this series, not really.  To clarify there are two sides for
> > getting GPU bubbles and no overlap:
> >
> > (1) VM operations implicitly wait for earlier CS submissions
> > (2) CS submissions implicitly wait for earlier VM operations
> >
> > Together, these combine to ensure that you get a (potentially small)
> > bubble any time VM work happens.
> >
> > Your series (and further ideas) tackles (2), and is a worthwhile thing
> > to do. However, while writing the userspace for this I noticed this
> > isn't enough to get rid of all our GPU bubbles. In particular when
> > doing a non-sparse map of a new BO, that tends to need to be waited on
> > for the next CS anyway for API semantics. Due to VM operations
> > happening on a single timeline that means this high priority map can
> > end up being blocked by earlier sparse maps and hence the bubble in
> > that case still exists.
> >
> > So in this series I try to tackle (1) instead. Since GPU work
> > typically lags behind CPU submissions and VM operations aren't that
> > slow, we can typically execute VM operations early enough that any
> > implicit syncs from (2) are less/no issue.
>
> Ok, once more since you don't seem to understand what I want to say: It
> isn't possible to fix #1 before you have fixed #2.
>
> The VM unmap operation here is a barrier which divides the CS operations
> in a before and after. This is intentional design.

Why is that barrier needed? The two barriers I got and understood and
I think we can deal with:

1) the VM unmap is a barrier between prior CS and later memory free.
2) The TLB flush need to happen between a VM unmap and later CS.

But why do we need the VM unmap to be a strict barrier between prior
CS and later CS?

>
> To get rid of this barrier you must first fix the part where CS
> submissions wait for the VM operation to complete, e.g. the necessity of
> the barrier.
>
> I'm working on this for a couple of years now and I'm really running out
> of idea how to explain this restriction.
>
> Regards,
> Christian.
>