Re: [PATCH] drm/i915: Stop propagating fence errors by default

Daniel Vetter <daniel@xxxxxxxx> · Mon, 10 May 2021 17:55:36 +0200

On Fri, May 07, 2021 at 09:35:21AM +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> 
> This is an alternative proposed fix for the below references bug report
> where dma fence error propagation is causing undesirable change in
> behaviour post GPU hang/reset.
> 
> Approach in this patch is to simply stop propagating all dma fence errors
> by default since that seems to be the upstream ask.
> 
> To handle the case where i915 needs error propagation for security, I add
> a new dma fence flag DMA_FENCE_FLAG_PROPAGATE_ERROR and make use of it in
> the command parsing chain only.
> 
> It sounds a plausible argument that fence propagation could be useful in
> which case a core flag to enable opt-in should be universally useful.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> Reported-by: Marcin Slusarz <marcin.slusarz@xxxxxxxxx>
> Reported-by: Miroslav Bendik
> References: 9e31c1fe45d5 ("drm/i915: Propagate errors on awaiting already signaled fences")
> References: https://gitlab.freedesktop.org/drm/intel/-/issues/3080
> Cc: Jason Ekstrand <jason.ekstrand@xxxxxxxxx>
> Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
> ---
>  drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 2 ++
>  drivers/gpu/drm/i915/i915_sw_fence.c           | 8 ++++----
>  drivers/gpu/drm/i915/i915_sw_fence.h           | 8 ++++++++
>  include/linux/dma-fence.h                      | 1 +

I still don't like this, least because we still introduce the concept of
error propagation to dma-fence (but hey only in i915 code, which is
exactly the kind of not-really-upstream approach we got a major chiding
for).

The only thing this does is make it explicitly opt-in instead opt-out,
like the first fix. The right approach is imo still to just throw it out,
and instead make the one error propagation we really need very, very
explicit. Instead of hiding it behind lots of magic.

The one error propagation we need is when the cmd parser work fails, it
must cancel it's corresponding request to make sure the batchbuffer
doesn't run. This should require about 2 lines in total:

- one line to store the request so that the cmd parser work can access it.
  No refcounting needed, because the the request cannot even start (much
  less get freed) before the cmd parser has singalled its fence

- one line to kill the request if the parsing fails. Maybe 2 if you
  include the if condition. I have no idea how that's done since I'm
  honestly lost how the i915 scheduler decides whether to run a batch or
  not. I'm guessing we have a version of this for the ringbuffer and the
  execlist backend (if not maybe gen7 cmdparser is broken?)

I don't see any need for magic behind-the-scenes propagation of such a
security critical error. Especially when that error propagation thing
caused security bugs of its own, is an i915-only feature, and not
motivated by any userspace/uapi requirements at all.

Thanks, Daniel

>  4 files changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index 297143511f99..6a516d1261d0 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -2522,6 +2522,8 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb,
>  	}
>  
>  	dma_fence_work_init(&pw->base, &eb_parse_ops);
> +	/* Propagate errors for security. */
> +	__set_bit(DMA_FENCE_FLAG_PROPAGATE_ERROR, &pw->base.dma.flags);
>  
>  	pw->engine = eb->engine;
>  	pw->batch = eb->batch->vma;
> diff --git a/drivers/gpu/drm/i915/i915_sw_fence.c b/drivers/gpu/drm/i915/i915_sw_fence.c
> index 2744558f3050..2ee917932ccf 100644
> --- a/drivers/gpu/drm/i915/i915_sw_fence.c
> +++ b/drivers/gpu/drm/i915/i915_sw_fence.c
> @@ -449,7 +449,7 @@ static void dma_i915_sw_fence_wake_timer(struct dma_fence *dma,
>  
>  	fence = xchg(&cb->base.fence, NULL);
>  	if (fence) {
> -		i915_sw_fence_set_error_once(fence, dma->error);
> +		i915_sw_fence_propagate_dma_fence_error(fence, dma);
>  		i915_sw_fence_complete(fence);
>  	}
>  
> @@ -480,7 +480,7 @@ int i915_sw_fence_await_dma_fence(struct i915_sw_fence *fence,
>  	might_sleep_if(gfpflags_allow_blocking(gfp));
>  
>  	if (dma_fence_is_signaled(dma)) {
> -		i915_sw_fence_set_error_once(fence, dma->error);
> +		i915_sw_fence_propagate_dma_fence_error(fence, dma);
>  		return 0;
>  	}
>  
> @@ -496,7 +496,7 @@ int i915_sw_fence_await_dma_fence(struct i915_sw_fence *fence,
>  		if (ret)
>  			return ret;
>  
> -		i915_sw_fence_set_error_once(fence, dma->error);
> +		i915_sw_fence_propagate_dma_fence_error(fence, dma);
>  		return 0;
>  	}
>  
> @@ -548,7 +548,7 @@ int __i915_sw_fence_await_dma_fence(struct i915_sw_fence *fence,
>  	debug_fence_assert(fence);
>  
>  	if (dma_fence_is_signaled(dma)) {
> -		i915_sw_fence_set_error_once(fence, dma->error);
> +		i915_sw_fence_propagate_dma_fence_error(fence, dma);
>  		return 0;
>  	}
>  
> diff --git a/drivers/gpu/drm/i915/i915_sw_fence.h b/drivers/gpu/drm/i915/i915_sw_fence.h
> index 30a863353ee6..872ef80ebd10 100644
> --- a/drivers/gpu/drm/i915/i915_sw_fence.h
> +++ b/drivers/gpu/drm/i915/i915_sw_fence.h
> @@ -116,4 +116,12 @@ i915_sw_fence_set_error_once(struct i915_sw_fence *fence, int error)
>  		cmpxchg(&fence->error, 0, error);
>  }
>  
> +static inline void
> +i915_sw_fence_propagate_dma_fence_error(struct i915_sw_fence *fence,
> +					struct dma_fence *dma)
> +{
> +	if (unlikely(test_bit(DMA_FENCE_FLAG_PROPAGATE_ERROR, &dma->flags)))
> +		i915_sw_fence_set_error_once(fence, dma->error);
> +}
> +
>  #endif /* _I915_SW_FENCE_H_ */
> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> index 6ffb4b2c6371..8dabe1650f11 100644
> --- a/include/linux/dma-fence.h
> +++ b/include/linux/dma-fence.h
> @@ -99,6 +99,7 @@ enum dma_fence_flag_bits {
>  	DMA_FENCE_FLAG_SIGNALED_BIT,
>  	DMA_FENCE_FLAG_TIMESTAMP_BIT,
>  	DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
> +	DMA_FENCE_FLAG_PROPAGATE_ERROR,
>  	DMA_FENCE_FLAG_USER_BITS, /* must always be last member */
>  };
>  
> -- 
> 2.30.2
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch