Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Christian König <christian.koenig@xxxxxxx> · Fri, 3 Jun 2022 20:41:03 +0200

Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
[SNIP]
Yeah, but that's exactly the bubble we try to avoid. Isn't it?
For this series, not really.  To clarify there are two sides for
getting GPU bubbles and no overlap:

(1) VM operations implicitly wait for earlier CS submissions
(2) CS submissions implicitly wait for earlier VM operations

Together, these combine to ensure that you get a (potentially small)
bubble any time VM work happens.

Your series (and further ideas) tackles (2), and is a worthwhile thing
to do. However, while writing the userspace for this I noticed this
isn't enough to get rid of all our GPU bubbles. In particular when
doing a non-sparse map of a new BO, that tends to need to be waited on
for the next CS anyway for API semantics. Due to VM operations
happening on a single timeline that means this high priority map can
end up being blocked by earlier sparse maps and hence the bubble in
that case still exists.

So in this series I try to tackle (1) instead. Since GPU work
typically lags behind CPU submissions and VM operations aren't that
slow, we can typically execute VM operations early enough that any
implicit syncs from (2) are less/no issue.
Ok, once more since you don't seem to understand what I want to say: It
isn't possible to fix #1 before you have fixed #2.

The VM unmap operation here is a barrier which divides the CS operations
in a before and after. This is intentional design.
Why is that barrier needed? The two barriers I got and understood and
I think we can deal with:

1) the VM unmap is a barrier between prior CS and later memory free.
2) The TLB flush need to happen between a VM unmap and later CS.

But why do we need the VM unmap to be a strict barrier between prior
CS and later CS?

Exactly because of the two reasons you mentioned.

#1 Is rather easy to fix, you just need to copy all dma_fences from the 
page table dma_resv object over to the BOs dma_resv object in the gem 
close handler. E.g. exactly what you suggested with the dma_resv_copy 
function.

#2 is a nightmare.

We can't move the TLB flush at the end of the unmap operation because on 
async TLB flushes are either a bit complicated (double flushes etc..) or 
don't even work at all because of hw bugs. So to have a reliable TLB 
flush we must make sure that nothing else is ongoing and that means 
CS->VM->CS barrier.

We try very hard to circumvent that already on maps by (for example) 
using a completely new VMID for CS after the VM map operation.

But for the unmap operation we would need some kind special dma_fence 
implementation which would not only wait for all existing dma_fence but 
also for the one added until the unmap operation is completed. Cause 
otherwise our operation we do at #1 would simply not catch all 
dma_fences which have access to the memory.

That's certainly doable, but I think just using the drm_exec stuff I 
already came up with is easier.

When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() 
goes away and we can keep track of the unmap operations in the bo_va 
structure.

With that done you can make the explicit sync you noted in the bo_va 
structure and implicit sync when the bo_va structure goes away.

Then the only reason I can see why we would need a CS->VM dependency is 
implicit synchronization, and that's what we are trying to avoid here in 
the first place.

Regards,
Christian.

To get rid of this barrier you must first fix the part where CS
submissions wait for the VM operation to complete, e.g. the necessity of
the barrier.

I'm working on this for a couple of years now and I'm really running out
of idea how to explain this restriction.

Regards,
Christian.