Re: [PATCH 03/15] dma-buf & drm/amdgpu: remove dma_resv workaround

Zack Rusin <zackr@xxxxxxxxxx> · Wed, 20 Apr 2022 19:28:51 +0000

On Wed, 2022-04-20 at 20:56 +0200, Christian König wrote:
> ⚠ External Email
> 
> Am 20.04.22 um 20:49 schrieb Christian König:
> > Am 20.04.22 um 20:41 schrieb Zack Rusin:
> > > On Wed, 2022-04-20 at 19:40 +0200, Christian König wrote:
> > > > Am 20.04.22 um 19:38 schrieb Zack Rusin:
> > > > > On Wed, 2022-04-20 at 09:37 +0200, Christian König wrote:
> > > > > > ⚠ External Email
> > > > > > 
> > > > > > Hi Zack,
> > > > > > 
> > > > > > Am 20.04.22 um 05:56 schrieb Zack Rusin:
> > > > > > > On Thu, 2022-04-07 at 10:59 +0200, Christian König wrote:
> > > > > > > > Rework the internals of the dma_resv object to allow
> > > > > > > > adding
> > > > > > > > more
> > > > > > > > than
> > > > > > > > one
> > > > > > > > write fence and remember for each fence what purpose it
> > > > > > > > had.
> > > > > > > > 
> > > > > > > > This allows removing the workaround from amdgpu which
> > > > > > > > used a
> > > > > > > > container
> > > > > > > > for
> > > > > > > > this instead.
> > > > > > > > 
> > > > > > > > Signed-off-by: Christian König
> > > > > > > > <christian.koenig@xxxxxxx>
> > > > > > > > Reviewed-by: Daniel Vetter <daniel.vetter@xxxxxxxx>
> > > > > > > > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > > > > > afaict this change broke vmwgfx which now kernel oops
> > > > > > > right
> > > > > > > after
> > > > > > > boot.
> > > > > > > I haven't had the time to look into it yet, so I'm not
> > > > > > > sure
> > > > > > > what's
> > > > > > > the
> > > > > > > problem. I'll look at this tomorrow, but just in case you
> > > > > > > have
> > > > > > > some
> > > > > > > clues, the backtrace follows:
> > > > > > that's a known issue and should already be fixed with:
> > > > > > 
> > > > > > commit d72dcbe9fce505228dae43bef9da8f2b707d1b3d
> > > > > > Author: Christian König <christian.koenig@xxxxxxx>
> > > > > > Date:   Mon Apr 11 15:21:59 2022 +0200
> > > > > Unfortunately that doesn't seem to be it. The backtrace is
> > > > > from the
> > > > > current (as of the time of sending of this email) drm-misc-
> > > > > next,
> > > > > which
> > > > > has this change, so it's something else.
> > > > Ok, that's strange. In this case I need to investigate further.
> > > > 
> > > > Maybe VMWGFX is adding more than one fence and we actually need
> > > > to
> > > > reserve multiple slots.
> > > This might be helper code issue with CONFIG_DEBUG_MUTEXES set. On
> > > that config
> > > dma_resv_reset_max_fences does:
> > >     fences->max_fences = fences->num_fences;
> > > For some objects num_fences is 0 and so after max_fences and
> > > num_fences are both 0.
> > > And then BUG_ON(num_fences >= max_fences) is triggered.
> > 
> > Yeah, but that's expected behavior.
> > 
> > What's not expected is that max_fences is still 0 (or equal to old
> > num_fences) when VMWGFX tries to add a new fence. The function
> > ttm_eu_reserve_buffers() should have reserved at least one fence
> > slot.
> > 
> > So the underlying problem is that either ttm_eu_reserve_buffers()
> > was
> > never called or VMWGFX tried to add more than one fence.
> 
> 
> To figure out what it is could you try the following code fragment:
> 
> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> index f46891012be3..a36f89d3f36d 100644
> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct
> vmw_validation_context *ctx,
>                  val_buf->bo = ttm_bo_get_unless_zero(&vbo->base);
>                  if (!val_buf->bo)
>                          return -ESRCH;
> -               val_buf->num_shared = 0;
> +               val_buf->num_shared = 16;
>                  list_add_tail(&val_buf->head, &ctx->bo_list);
>                  bo_node->as_mob = as_mob;
>                  bo_node->cpu_blit = cpu_blit;

Fails the same BUG_ON with num_fences and max_fences == 0.

z