Re: [PATCH] drm/i915: Stop propagating fence errors by default

Daniel Vetter <daniel@xxxxxxxx> · Mon, 17 May 2021 17:51:10 +0200

On Mon, May 17, 2021 at 5:33 PM Tvrtko Ursulin
<tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
> On 17/05/2021 16:12, Daniel Vetter wrote:
> > On Tue, May 11, 2021 at 10:05:27AM +0100, Tvrtko Ursulin wrote:
> >>
> >> On 10/05/2021 16:55, Daniel Vetter wrote:
> >>> On Fri, May 07, 2021 at 09:35:21AM +0100, Tvrtko Ursulin wrote:
> >>>> From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> >>>>
> >>>> This is an alternative proposed fix for the below references bug report
> >>>> where dma fence error propagation is causing undesirable change in
> >>>> behaviour post GPU hang/reset.
> >>>>
> >>>> Approach in this patch is to simply stop propagating all dma fence errors
> >>>> by default since that seems to be the upstream ask.
> >>>>
> >>>> To handle the case where i915 needs error propagation for security, I add
> >>>> a new dma fence flag DMA_FENCE_FLAG_PROPAGATE_ERROR and make use of it in
> >>>> the command parsing chain only.
> >>>>
> >>>> It sounds a plausible argument that fence propagation could be useful in
> >>>> which case a core flag to enable opt-in should be universally useful.
> >>>>
> >>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> >>>> Reported-by: Marcin Slusarz <marcin.slusarz@xxxxxxxxx>
> >>>> Reported-by: Miroslav Bendik
> >>>> References: 9e31c1fe45d5 ("drm/i915: Propagate errors on awaiting already signaled fences")
> >>>> References: https://gitlab.freedesktop.org/drm/intel/-/issues/3080
> >>>> Cc: Jason Ekstrand <jason.ekstrand@xxxxxxxxx>
> >>>> Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
> >>>> ---
> >>>>    drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 2 ++
> >>>>    drivers/gpu/drm/i915/i915_sw_fence.c           | 8 ++++----
> >>>>    drivers/gpu/drm/i915/i915_sw_fence.h           | 8 ++++++++
> >>>>    include/linux/dma-fence.h                      | 1 +
> >>>
> >>> I still don't like this, least because we still introduce the concept of
> >>> error propagation to dma-fence (but hey only in i915 code, which is
> >>> exactly the kind of not-really-upstream approach we got a major chiding
> >>> for).
> >>>
> >>> The only thing this does is make it explicitly opt-in instead opt-out,
> >>> like the first fix. The right approach is imo still to just throw it out,
> >>> and instead make the one error propagation we really need very, very
> >>> explicit. Instead of hiding it behind lots of magic.
> >>>
> >>> The one error propagation we need is when the cmd parser work fails, it
> >>> must cancel it's corresponding request to make sure the batchbuffer
> >>> doesn't run. This should require about 2 lines in total:
> >>>
> >>> - one line to store the request so that the cmd parser work can access it.
> >>>     No refcounting needed, because the the request cannot even start (much
> >>>     less get freed) before the cmd parser has singalled its fence
> >>>
> >>> - one line to kill the request if the parsing fails. Maybe 2 if you
> >>>     include the if condition. I have no idea how that's done since I'm
> >>>     honestly lost how the i915 scheduler decides whether to run a batch or
> >>>     not. I'm guessing we have a version of this for the ringbuffer and the
> >>>     execlist backend (if not maybe gen7 cmdparser is broken?)
> >>>
> >>> I don't see any need for magic behind-the-scenes propagation of such a
> >>> security critical error. Especially when that error propagation thing
> >>> caused security bugs of its own, is an i915-only feature, and not
> >>> motivated by any userspace/uapi requirements at all.
> >>
> >> I took this approach because to me propagating errors sounds more logical
> >> than ignoring them and I was arguing in the commit message that the
> >> infrastructure to enable that could be put in place as opt-in.
> >>
> >> I also do not see a lot of magic in this patch. Only thing, potentially the
> >> logic should be inverted so that the waiter marks itself as interested in
> >> receiving errors. That would probably make even more sense as a core
> >> concept.
> >>
> >> Has there been a wider discussion on this topic in the past? I am curious to
> >> know, even if propagation currently is i915 only, could other drivers be
> >> interested.
> >
> > There hasn't been. i915-gem team decided "this is a cool concept", which
> > resulted in a security bug. Now we're a few months in arguing whether a
> > cool-looking concept that leads to a security bug is maybe a good idea,
> > and whether we should sneak it in as a core concept to dma-buf.h without
> > any wider discussion on the concept.
> >
> >> Note that it adds almost nothing to the dma-buf common code about a single
> >> flag, and at some point (currently missing) documentation on the very flag.
> >
> > This is really not how upstream collaboration works, and it needs to stop.
> >
> > If you want this, start another thread arguing why this is a good idea,
> > fully decoupled from the security fix here.
>
> When I asked you whether you know there were past discussions on this
> topic, clearly the point of that was to figure out whether a new
> discussion needs to be started, or I need to go and read an existing one
> to get up to speed.
>
> I don't know how you interpreted that as an attempt to sneak anything
> in. And I don't know how I could have reliably figured out the answer to
> that question without asking. So colour me confused.
>
> To clarify on the security issue part - are you talking about
> https://gitlab.freedesktop.org/drm/intel/-/issues/3080, or the other
> security issue, the one which would be caused by simply reverting the
> error propagation in i915?

Both.

But what I really don't get is why we keep defending a "it's cool"
type of concept that accidentally caused a security bug as a
consequence that neither author nor reviewer managed to forsee.

If it accidentally causes a security bug and is otherwise not
justified by anything, it's not cool. There isn't anything left to
argue about here.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch