Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

Daniel Vetter <daniel@xxxxxxxx> · Wed, 26 May 2021 15:51:45 +0200

On Wed, May 26, 2021 at 3:32 PM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
>
> Am 25.05.21 um 17:23 schrieb Daniel Vetter:
> > On Tue, May 25, 2021 at 5:05 PM Christian König
> > <ckoenig.leichtzumerken@xxxxxxxxx> wrote:
> >> Hi Daniel,
> >>
> >> Am 25.05.21 um 15:05 schrieb Daniel Vetter:
> >>> Hi Christian,
> >>>
> >>> On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
> >>>> Am 21.05.21 um 20:31 schrieb Daniel Vetter:
> >>>> This works by adding the fence of the last eviction DMA operation to BOs
> >>>> when their backing store is newly allocated. That's what the
> >>>> ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692
> >>>>
> >>>> Now the problem is it is possible that the application is terminated before
> >>>> it can complete it's command submission. But since resource management only
> >>>> waits for the shared fences when there are some there is a chance that we
> >>>> free up memory while it is still in use.
> >>> Hm where is this code? Would help in my audit that I wanted to do this
> >>> week? If you look at most other places like
> >>> drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
> >>> treat the shared fences special and always also include the exclusive one.
> >> See amdgpu_gem_object_close():
> >>
> >> ...
> >>           fence = dma_resv_get_excl(bo->tbo.base.resv);
> >>           if (fence) {
> >>                   amdgpu_bo_fence(bo, fence, true);
> >>                   fence = NULL;
> >>           }
> >> ...
> >>
> >> We explicitly added that because resource management of some other
> >> driver was going totally bananas without that.
> >>
> >> But I'm not sure which one that was. Maybe dig a bit in the git and
> >> mailing history of that.
> > Hm I looked and it's
> >
> > commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb
> > Author: Christian König <christian.koenig@xxxxxxx>
> > Date:   Thu Mar 12 12:03:34 2020 +0100
> >
> >     drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4
> >
> > That sounded more like amdgpu itself needing this, not another driver?
>
> No, that patch was just a follow up moving the functionality around.

That patch added the "add exclusive fence to shared slots before
amdgpu_vm_clear_freed() call", which I thought was at least part of
your fix.

> > But looking at amdgpu_vm_bo_update_mapping() it seems to pick the
> > right fencing mode for gpu pte clearing, so I'm really not sure what
> > the bug was that you worked around here?The implementation boils down
> > to amdgpu_sync_resv() which syncs for the exclusive fence, always. And
> > there's nothing else that I could find in public history at least, no
> > references to bug reports or anything. I think you need to dig
> > internally, because as-is I'm not seeing the problem here.
> >
> > Or am I missing something here?
>
> See the code here for example:
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_fence.c#L361
>
> Nouveau assumes that when a shared fence is present it doesn't need to
> wait for the exclusive one because the shared are always supposed to
> finish after the exclusive one.
>
> But for page table unmap fences that isn't true and we ran into a really
> nasty and hard to reproduce bug because of this.
>
> I think it would be much more defensive if we could say that we always
> wait for the exclusive fence and fix the use case in nouveau and double
> check if somebody else does stuff like that as well.

Yeah most other drivers do the defensive thing here. I think it would
be good to standardize on that. I'll add that to my list and do more
auditing.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch