On Wed, May 26, 2021 at 3:32 PM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > Am 25.05.21 um 17:23 schrieb Daniel Vetter: > > On Tue, May 25, 2021 at 5:05 PM Christian König > > <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > >> Hi Daniel, > >> > >> Am 25.05.21 um 15:05 schrieb Daniel Vetter: > >>> Hi Christian, > >>> > >>> On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote: > >>>> Am 21.05.21 um 20:31 schrieb Daniel Vetter: > >>>> This works by adding the fence of the last eviction DMA operation to BOs > >>>> when their backing store is newly allocated. That's what the > >>>> ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692 > >>>> > >>>> Now the problem is it is possible that the application is terminated before > >>>> it can complete it's command submission. But since resource management only > >>>> waits for the shared fences when there are some there is a chance that we > >>>> free up memory while it is still in use. > >>> Hm where is this code? Would help in my audit that I wanted to do this > >>> week? If you look at most other places like > >>> drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't > >>> treat the shared fences special and always also include the exclusive one. > >> See amdgpu_gem_object_close(): > >> > >> ... > >> fence = dma_resv_get_excl(bo->tbo.base.resv); > >> if (fence) { > >> amdgpu_bo_fence(bo, fence, true); > >> fence = NULL; > >> } > >> ... > >> > >> We explicitly added that because resource management of some other > >> driver was going totally bananas without that. > >> > >> But I'm not sure which one that was. Maybe dig a bit in the git and > >> mailing history of that. > > Hm I looked and it's > > > > commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb > > Author: Christian König <christian.koenig@xxxxxxx> > > Date: Thu Mar 12 12:03:34 2020 +0100 > > > > drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4 > > > > That sounded more like amdgpu itself needing this, not another driver? > > No, that patch was just a follow up moving the functionality around. That patch added the "add exclusive fence to shared slots before amdgpu_vm_clear_freed() call", which I thought was at least part of your fix. > > But looking at amdgpu_vm_bo_update_mapping() it seems to pick the > > right fencing mode for gpu pte clearing, so I'm really not sure what > > the bug was that you worked around here?The implementation boils down > > to amdgpu_sync_resv() which syncs for the exclusive fence, always. And > > there's nothing else that I could find in public history at least, no > > references to bug reports or anything. I think you need to dig > > internally, because as-is I'm not seeing the problem here. > > > > Or am I missing something here? > > See the code here for example: > https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_fence.c#L361 > > Nouveau assumes that when a shared fence is present it doesn't need to > wait for the exclusive one because the shared are always supposed to > finish after the exclusive one. > > But for page table unmap fences that isn't true and we ran into a really > nasty and hard to reproduce bug because of this. > > I think it would be much more defensive if we could say that we always > wait for the exclusive fence and fix the use case in nouveau and double > check if somebody else does stuff like that as well. Yeah most other drivers do the defensive thing here. I think it would be good to standardize on that. I'll add that to my list and do more auditing. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch