Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

Daniel Vetter <daniel@xxxxxxxx> · Mon, 3 Jul 2017 09:55:48 +0200

On Fri, Jun 30, 2017 at 09:46:36PM +0300, Ville Syrjälä wrote:
> On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:
> > On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjälä
> > <ville.syrjala@xxxxxxxxxxxxxxx> wrote:
> > >> And if the GEM folks insist the old behavior can't be restored, then we
> > >> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> > >> in i915_sw_fence. Force-completing all render requests atomic updates
> > >> depend upon is imo the simplest solution to this, and we've had a driver
> > >> that worked like that for years.
> > >
> > > And it used to break all the time. I think we've had to fix it at least
> > > three times by now. So I tend to think it's better to fix it in a way
> > > that won't break so easily.
> > 
> > Why exactly is making the atomic code massive more tricky the easy
> > fix?
> 
> I don't see what this massive trickyness is. Compared to the rest of
> atomic what I have is absolutely trivial. Just the
> duplicate_committed_state() and the '.committed_state = foo'
> assignments in hw_done(). That's it really.

>From a quick look and your description it seems full of races. I'm not
sure it'll still be simple once those are fixed.

> > That's the part I don't get. Yes it got broken a bunch because no
> > one runs CI and everyone forgets that gen3/4 reset the display in gpu
> > reset, but in the end we do have a depency loop, and either the
> > modeset side or the render side needs to bail out and cancel it's
> > async stuff (whether that's a request or a nonblocking flip/atomic
> > commit doesn't matter). In my opinion, cancelling the request (even if
> > we're clever and only cancel the requests for the modeset waiters,
> > which isn't to hard to pull off) seems about the simplest option.
> > Especially since we need that code anyway, even TDR can't safe
> > everything and resubmit under all circumstances (at least the buggy
> > batch can't be resubmitted).
> > 
> > Cancelling any kind of atomic commit otoh looks like a lot more
> > complexity.
> 
> I'm not cancelling anything.

Well by overtaking the in-flight commit you are at least fighting with
that. Either you need to cancel that one, or insert the gpu reset commit
at the right point (and with the right state). Current code drops that and
instead seems to just hope it doesn't lead to tears too much.

> > Why do you think this is the easier, or at least less
> > fragile option? This patch series is full of FIXMEs, and even the more
> > complete set seems to have a pile of holes. Plus we need to stop using
> > obj->state, and I don't see any easy way to test for that (since the
> > gen3/4 gpu reset case is the only corner cases that currently needs
> > that).
> 
> We need to fix that stuff anyway if we ever want to queue up multiple
> commits for the same crtc. The stuff I have that is specific to this
> reset stuff is actually very simple. The rest is just fixing up the
> huge mess we've already made.

Rewriting the world for a regression fix seems a bit much is all I'm
saying. And I'm not sure your approach works without that "rewrite the
world" step. Defacto what your current patches seem to result in is
- we commit the final sw state in gpu reset
- before we resubmit the rendering

That's much easier to pull of by simply force-completing all
i915_sw_fences before we take any modeset locks in the gpu reset path.
Note that we don't need to force-complete any i915_gem_request, we can
just force-complete the i915_sw_fences the work item is blocked on. Needs
some care to avoid races with a new atomic commit (since we need to
force-complete before we grab locks one might sneak in), but that's a
standard pattern.

Plus we then need to wait for all outstanding nonblocking commits once we
do have all modeset locks, since with atomic holding the locks only syncs
against synchronous hw commits (i.e do a fake synchronous commit before we
nuke the display and we're good). A variant of wait_for_dependencies that
waits for all crtc instead of just the crtc in an atomic commit would do I
think.

None of this requires any of the prep work we need the fancy additional
atomic stuff you plan to do. Which I think is really good for a regression
fix.

So again, why do we need to rewrite the world (since these patches here
seem to just be the racy poc) to fix reset on gen3/4?

I know you want to do all this, but tangling up a regression fix in a
rewrite isn't a good idea in my opinion. I'm not against your long-term
plans, I just think it'd be good if we can have them as orthogonal pieces
if feasible.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch