On Mon, Jun 6, 2022 at 12:15 PM Christian König <christian.koenig@xxxxxxx> wrote: > > > > Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen: > > On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@xxxxxxx> wrote: > >> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen: > >>> [SNIP] > >>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? > >>>>> For this series, not really. To clarify there are two sides for > >>>>> getting GPU bubbles and no overlap: > >>>>> > >>>>> (1) VM operations implicitly wait for earlier CS submissions > >>>>> (2) CS submissions implicitly wait for earlier VM operations > >>>>> > >>>>> Together, these combine to ensure that you get a (potentially small) > >>>>> bubble any time VM work happens. > >>>>> > >>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing > >>>>> to do. However, while writing the userspace for this I noticed this > >>>>> isn't enough to get rid of all our GPU bubbles. In particular when > >>>>> doing a non-sparse map of a new BO, that tends to need to be waited on > >>>>> for the next CS anyway for API semantics. Due to VM operations > >>>>> happening on a single timeline that means this high priority map can > >>>>> end up being blocked by earlier sparse maps and hence the bubble in > >>>>> that case still exists. > >>>>> > >>>>> So in this series I try to tackle (1) instead. Since GPU work > >>>>> typically lags behind CPU submissions and VM operations aren't that > >>>>> slow, we can typically execute VM operations early enough that any > >>>>> implicit syncs from (2) are less/no issue. > >>>> Ok, once more since you don't seem to understand what I want to say: It > >>>> isn't possible to fix #1 before you have fixed #2. > >>>> > >>>> The VM unmap operation here is a barrier which divides the CS operations > >>>> in a before and after. This is intentional design. > >>> Why is that barrier needed? The two barriers I got and understood and > >>> I think we can deal with: > >>> > >>> 1) the VM unmap is a barrier between prior CS and later memory free. > >>> 2) The TLB flush need to happen between a VM unmap and later CS. > >>> > >>> But why do we need the VM unmap to be a strict barrier between prior > >>> CS and later CS? > >> Exactly because of the two reasons you mentioned. > > This is the part I'm not seeing. I get that removing #2 is a > > nightmare, which is why I did something that doesn't violate that > > constraint. > > > > Like if an explicit CS that was running before the VM operation runs > > till after the VM operation (and hence possibly till after the TLB > > flush, or otherwise have the TLB flush not apply due to lack of async > > TLB flush support), that is not an issue. It might see the state from > > before the unmap, or after the unmap, or some intermediate state and > > all of those would be okay. > > > > We still get the constraint that the TLB flush happens between the VM > > unmap and later CS and hence the unmap is certainly visible to them. > > So you basically just want to set the sync mode in > amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap? Yes, with the caveat that I want to do that only for DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with implicit sync we get the old implicit behavior, if we submit a CS with explicit sync we get the new explicit behavior). The rest of the series is basically just for enabling explicit sync submissions. > That should be doable, but then you don't need any of the other changes. > > Regards, > Christian. > > > > >> #1 Is rather easy to fix, you just need to copy all dma_fences from the > >> page table dma_resv object over to the BOs dma_resv object in the gem > >> close handler. E.g. exactly what you suggested with the dma_resv_copy > >> function. > >> > >> #2 is a nightmare. > >> > >> We can't move the TLB flush at the end of the unmap operation because on > >> async TLB flushes are either a bit complicated (double flushes etc..) or > >> don't even work at all because of hw bugs. So to have a reliable TLB > >> flush we must make sure that nothing else is ongoing and that means > >> CS->VM->CS barrier. > >> > >> We try very hard to circumvent that already on maps by (for example) > >> using a completely new VMID for CS after the VM map operation. > >> > >> But for the unmap operation we would need some kind special dma_fence > >> implementation which would not only wait for all existing dma_fence but > >> also for the one added until the unmap operation is completed. Cause > >> otherwise our operation we do at #1 would simply not catch all > >> dma_fences which have access to the memory. > >> > >> That's certainly doable, but I think just using the drm_exec stuff I > >> already came up with is easier. > >> > >> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() > >> goes away and we can keep track of the unmap operations in the bo_va > >> structure. > >> > >> With that done you can make the explicit sync you noted in the bo_va > >> structure and implicit sync when the bo_va structure goes away. > >> > >> Then the only reason I can see why we would need a CS->VM dependency is > >> implicit synchronization, and that's what we are trying to avoid here in > >> the first place. > >> > >> Regards, > >> Christian. > >> > >>>> To get rid of this barrier you must first fix the part where CS > >>>> submissions wait for the VM operation to complete, e.g. the necessity of > >>>> the barrier. > >>>> > >>>> I'm working on this for a couple of years now and I'm really running out > >>>> of idea how to explain this restriction. > >>>> > >>>> Regards, > >>>> Christian. > >>>> >