Hi Christian, Friendly ping on the comments here. Are you okay with me re-spinning the series with a fixed patch 1 or do we need further discussion on the approach here? Thanks, Bas On Mon, Jun 6, 2022 at 1:00 PM Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx> wrote: > > On Mon, Jun 6, 2022 at 12:35 PM Christian König > <christian.koenig@xxxxxxx> wrote: > > > > Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen: > > > On Mon, Jun 6, 2022 at 12:15 PM Christian König > > > <christian.koenig@xxxxxxx> wrote: > > >> > > >> > > >> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen: > > >>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@xxxxxxx> wrote: > > >>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen: > > >>>>> [SNIP] > > >>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? > > >>>>>>> For this series, not really. To clarify there are two sides for > > >>>>>>> getting GPU bubbles and no overlap: > > >>>>>>> > > >>>>>>> (1) VM operations implicitly wait for earlier CS submissions > > >>>>>>> (2) CS submissions implicitly wait for earlier VM operations > > >>>>>>> > > >>>>>>> Together, these combine to ensure that you get a (potentially small) > > >>>>>>> bubble any time VM work happens. > > >>>>>>> > > >>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing > > >>>>>>> to do. However, while writing the userspace for this I noticed this > > >>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when > > >>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on > > >>>>>>> for the next CS anyway for API semantics. Due to VM operations > > >>>>>>> happening on a single timeline that means this high priority map can > > >>>>>>> end up being blocked by earlier sparse maps and hence the bubble in > > >>>>>>> that case still exists. > > >>>>>>> > > >>>>>>> So in this series I try to tackle (1) instead. Since GPU work > > >>>>>>> typically lags behind CPU submissions and VM operations aren't that > > >>>>>>> slow, we can typically execute VM operations early enough that any > > >>>>>>> implicit syncs from (2) are less/no issue. > > >>>>>> Ok, once more since you don't seem to understand what I want to say: It > > >>>>>> isn't possible to fix #1 before you have fixed #2. > > >>>>>> > > >>>>>> The VM unmap operation here is a barrier which divides the CS operations > > >>>>>> in a before and after. This is intentional design. > > >>>>> Why is that barrier needed? The two barriers I got and understood and > > >>>>> I think we can deal with: > > >>>>> > > >>>>> 1) the VM unmap is a barrier between prior CS and later memory free. > > >>>>> 2) The TLB flush need to happen between a VM unmap and later CS. > > >>>>> > > >>>>> But why do we need the VM unmap to be a strict barrier between prior > > >>>>> CS and later CS? > > >>>> Exactly because of the two reasons you mentioned. > > >>> This is the part I'm not seeing. I get that removing #2 is a > > >>> nightmare, which is why I did something that doesn't violate that > > >>> constraint. > > >>> > > >>> Like if an explicit CS that was running before the VM operation runs > > >>> till after the VM operation (and hence possibly till after the TLB > > >>> flush, or otherwise have the TLB flush not apply due to lack of async > > >>> TLB flush support), that is not an issue. It might see the state from > > >>> before the unmap, or after the unmap, or some intermediate state and > > >>> all of those would be okay. > > >>> > > >>> We still get the constraint that the TLB flush happens between the VM > > >>> unmap and later CS and hence the unmap is certainly visible to them. > > >> So you basically just want to set the sync mode in > > >> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap? > > > Yes, with the caveat that I want to do that only for > > > DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with > > > implicit sync we get the old implicit behavior, if we submit a CS with > > > explicit sync we get the new explicit behavior). The rest of the > > > series is basically just for enabling explicit sync submissions. > > > > That part won't work at all and would cause additional synchronization > > problems. > > > > First of all for implicit synced CS we should use READ, not BOOKKEEP. > > Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've > > fixed that this causes memory corruption, but it is still nice to avoid. > > Yes, what I'm saying is that on implicit sync CS submission should add > READ fences to the dma resv and on explicit sync CS submission should > add BOOKKEEP fences. > > > > > BOOKKEEP can only be used by VM updates themselves. So that they don't > > interfere with CS. > > That is the point why we would go BOOKKEEP for explicit sync CS > submissions, no? Explicit submission shouldn't interfere with any > other CS submissions. That includes being totally ignored by GL > importers (if we want to have synchronization there between an > explicit submission and GL, userspace is expected to use Jason's > dmabuf fence import/export IOCTLs) > > > > > Then the second problem is that the VM IOCTL has absolutely no idea what > > the CS IOCTL would be doing. That's why we have added the EXPLICIT sync > > flag on the BO. > > It doesn't need to? We just use a different sync_mode for BOOKKEEP > fences vs others: > https://patchwork.freedesktop.org/patch/487887/?series=104578&rev=2 > > (the nice thing about doing it this way is that it is independent of > the IOCTL, i.e. also works for the delayed mapping changes we trigger > on CS submit) > > > > > Regards, > > Christian. > > > > > > > >> That should be doable, but then you don't need any of the other changes. > > >> > > >> Regards, > > >> Christian. > > >> > > >>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the > > >>>> page table dma_resv object over to the BOs dma_resv object in the gem > > >>>> close handler. E.g. exactly what you suggested with the dma_resv_copy > > >>>> function. > > >>>> > > >>>> #2 is a nightmare. > > >>>> > > >>>> We can't move the TLB flush at the end of the unmap operation because on > > >>>> async TLB flushes are either a bit complicated (double flushes etc..) or > > >>>> don't even work at all because of hw bugs. So to have a reliable TLB > > >>>> flush we must make sure that nothing else is ongoing and that means > > >>>> CS->VM->CS barrier. > > >>>> > > >>>> We try very hard to circumvent that already on maps by (for example) > > >>>> using a completely new VMID for CS after the VM map operation. > > >>>> > > >>>> But for the unmap operation we would need some kind special dma_fence > > >>>> implementation which would not only wait for all existing dma_fence but > > >>>> also for the one added until the unmap operation is completed. Cause > > >>>> otherwise our operation we do at #1 would simply not catch all > > >>>> dma_fences which have access to the memory. > > >>>> > > >>>> That's certainly doable, but I think just using the drm_exec stuff I > > >>>> already came up with is easier. > > >>>> > > >>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() > > >>>> goes away and we can keep track of the unmap operations in the bo_va > > >>>> structure. > > >>>> > > >>>> With that done you can make the explicit sync you noted in the bo_va > > >>>> structure and implicit sync when the bo_va structure goes away. > > >>>> > > >>>> Then the only reason I can see why we would need a CS->VM dependency is > > >>>> implicit synchronization, and that's what we are trying to avoid here in > > >>>> the first place. > > >>>> > > >>>> Regards, > > >>>> Christian. > > >>>> > > >>>>>> To get rid of this barrier you must first fix the part where CS > > >>>>>> submissions wait for the VM operation to complete, e.g. the necessity of > > >>>>>> the barrier. > > >>>>>> > > >>>>>> I'm working on this for a couple of years now and I'm really running out > > >>>>>> of idea how to explain this restriction. > > >>>>>> > > >>>>>> Regards, > > >>>>>> Christian. > > >>>>>> > >