Re: [PATCH] drm/i915: Stop propagating fence errors by default

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Mon, 17 May 2021 16:33:45 +0100

On 17/05/2021 16:12, Daniel Vetter wrote:
On Tue, May 11, 2021 at 10:05:27AM +0100, Tvrtko Ursulin wrote:

On 10/05/2021 16:55, Daniel Vetter wrote:
On Fri, May 07, 2021 at 09:35:21AM +0100, Tvrtko Ursulin wrote:
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>

This is an alternative proposed fix for the below references bug report
where dma fence error propagation is causing undesirable change in
behaviour post GPU hang/reset.

Approach in this patch is to simply stop propagating all dma fence errors
by default since that seems to be the upstream ask.

To handle the case where i915 needs error propagation for security, I add
a new dma fence flag DMA_FENCE_FLAG_PROPAGATE_ERROR and make use of it in
the command parsing chain only.

It sounds a plausible argument that fence propagation could be useful in
which case a core flag to enable opt-in should be universally useful.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Reported-by: Marcin Slusarz <marcin.slusarz@xxxxxxxxx>
Reported-by: Miroslav Bendik
References: 9e31c1fe45d5 ("drm/i915: Propagate errors on awaiting already signaled fences")
References: https://gitlab.freedesktop.org/drm/intel/-/issues/3080
Cc: Jason Ekstrand <jason.ekstrand@xxxxxxxxx>
Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
---
   drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 2 ++
   drivers/gpu/drm/i915/i915_sw_fence.c           | 8 ++++----
   drivers/gpu/drm/i915/i915_sw_fence.h           | 8 ++++++++
   include/linux/dma-fence.h                      | 1 +

I still don't like this, least because we still introduce the concept of
error propagation to dma-fence (but hey only in i915 code, which is
exactly the kind of not-really-upstream approach we got a major chiding
for).

The only thing this does is make it explicitly opt-in instead opt-out,
like the first fix. The right approach is imo still to just throw it out,
and instead make the one error propagation we really need very, very
explicit. Instead of hiding it behind lots of magic.

The one error propagation we need is when the cmd parser work fails, it
must cancel it's corresponding request to make sure the batchbuffer
doesn't run. This should require about 2 lines in total:

- one line to store the request so that the cmd parser work can access it.
    No refcounting needed, because the the request cannot even start (much
    less get freed) before the cmd parser has singalled its fence

- one line to kill the request if the parsing fails. Maybe 2 if you
    include the if condition. I have no idea how that's done since I'm
    honestly lost how the i915 scheduler decides whether to run a batch or
    not. I'm guessing we have a version of this for the ringbuffer and the
    execlist backend (if not maybe gen7 cmdparser is broken?)

I don't see any need for magic behind-the-scenes propagation of such a
security critical error. Especially when that error propagation thing
caused security bugs of its own, is an i915-only feature, and not
motivated by any userspace/uapi requirements at all.

I took this approach because to me propagating errors sounds more logical
than ignoring them and I was arguing in the commit message that the
infrastructure to enable that could be put in place as opt-in.

I also do not see a lot of magic in this patch. Only thing, potentially the
logic should be inverted so that the waiter marks itself as interested in
receiving errors. That would probably make even more sense as a core
concept.

Has there been a wider discussion on this topic in the past? I am curious to
know, even if propagation currently is i915 only, could other drivers be
interested.

There hasn't been. i915-gem team decided "this is a cool concept", which
resulted in a security bug. Now we're a few months in arguing whether a
cool-looking concept that leads to a security bug is maybe a good idea,
and whether we should sneak it in as a core concept to dma-buf.h without
any wider discussion on the concept.

Note that it adds almost nothing to the dma-buf common code about a single
flag, and at some point (currently missing) documentation on the very flag.

This is really not how upstream collaboration works, and it needs to stop.

If you want this, start another thread arguing why this is a good idea,
fully decoupled from the security fix here.

When I asked you whether you know there were past discussions on this 
topic, clearly the point of that was to figure out whether a new 
discussion needs to be started, or I need to go and read an existing one 
to get up to speed.

I don't know how you interpreted that as an attempt to sneak anything 
in. And I don't know how I could have reliably figured out the answer to 
that question without asking. So colour me confused.

To clarify on the security issue part - are you talking about 
https://gitlab.freedesktop.org/drm/intel/-/issues/3080, or the other 
security issue, the one which would be caused by simply reverting the 
error propagation in i915?

Regards,

Tvrtko