On Thu, Jul 31, 2014 at 04:37:14PM +0000, Mcaulay, Alistair wrote: > Hi Daniel, > > Something more like this then? (and revert the change to intel_ring_begin(), putting it back to how it was ) Yeah, roughly. Except that I would place the reload_in_reset wrapping in the i915_reset function. It is paramount that we never leak this outside of the dev->struct_mutex protection so that other threads can't ever observe this to be set. So putting it right next to the mutex locking is better. Also I think you've wrapped the wrong function - the re-init is done in i915_gem_init_hw, this here just resets the software state (mostly) and is done before the actual gpu hw reset is done. gem_init_hw is only run if the reset succeeds. -Daniel > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > index 991b663..b811ff2 100644 > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -1217,6 +1217,9 @@ struct i915_gpu_error { > > /* For missed irq/seqno simulation. */ > unsigned int test_irq_rings; > + > + /* Used to prevent gem_check_wedged returning -EAGAIN during gpu reset */ > + bool reload_in_progress; > }; > > enum modeset_restore { > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index b38e086..a25d3b5 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -1085,7 +1085,9 @@ i915_gem_check_wedge(struct i915_gpu_error *error, > if (i915_terminally_wedged(error)) > return -EIO; > > - return -EAGAIN; > + /* Check if GPU Reset is in progress */ > + if (!error->reload_in_reset) > + return -EAGAIN; > } > > return 0; > @@ -2579,6 +2581,8 @@ void i915_gem_reset(struct drm_device *dev) > struct intel_engine_cs *ring; > int i; > > + /* Used to prevent gem_check_wedged returning -EAGAIN during gpu reset */ > + dev_priv->gpu_error.reload_in_reset = true; > /* > * Before we free the objects from the requests, we need to inspect > * them for finding the guilty party. As the requests only borrow > @@ -2591,6 +2595,8 @@ void i915_gem_reset(struct drm_device *dev) > i915_gem_reset_ring_cleanup(dev_priv, ring); > > i915_gem_restore_fences(dev); > + > + dev_priv->gpu_error.reload_in_reset = false; > } > > > -----Original Message----- > From: Daniel Vetter [mailto:daniel.vetter@xxxxxxxx] On Behalf Of Daniel Vetter > Sent: Wednesday, July 30, 2014 10:01 PM > To: Mcaulay, Alistair > Cc: Daniel Vetter; Chris Wilson; Ben Widawsky; intel-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH] drm/i915: Rework GPU reset sequence to match driver load & thaw > > On Wed, Jul 30, 2014 at 04:59:33PM +0000, Mcaulay, Alistair wrote: > > Hi Daniel, > > > > could you please be clearer on the change you mean. I think you mean something functionally equivalent to the code below, but done in a less hacky way. > > (This slight change has made no change to test results) Or is the idea > > to return at a different point to this? > > I couldn't find " dev_priv->mm.reload_in_reset or similar" in the > > code. The only thing I can find is error->reset_counter, which is used > > in check_wedge(). Bottom bit set means RESET_IN_PROGRESS, top bit > > means WEDGED > > Well I've meant that you have to add a new dev_prive->mm.realod_in_reset. > And the below won't work since in all other places but when doing a gpu reset we want the -EAGAIN to reach callers. Actually it's really important that if we have an -EGAIN we don't eat it. > > And I guess the check for mm.reload_in_reset should actually be in gem_check_wedged. > -Daniel > > > > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c > > @@ -1832,7 +1832,9 @@ int intel_ring_begin(struct intel_engine_cs > > *ring, > > > > ret = i915_gem_check_wedge(&dev_priv->gpu_error, > > dev_priv->mm.interruptible); > > - if (ret) > > + > > + /* -EAGAIN means a reset is in progress, it is Ok to return */ > > + if (ret == -EAGAIN) > > + return 0; > > + if (ret) > > + return ret; > > > > ret = __intel_ring_prepare(ring, num_dwords * sizeof(uint32_t)); > > > > Alistair. > > > > -----Original Message----- > > From: Intel-gfx [mailto:intel-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx] On > > Behalf Of Daniel Vetter > > Sent: Tuesday, July 29, 2014 11:33 AM > > To: Chris Wilson; Daniel Vetter; Ben Widawsky; > > intel-gfx@xxxxxxxxxxxxxxxxxxxxx > > Subject: Re: [PATCH] drm/i915: Rework GPU reset sequence > > to match driver load & thaw > > > > On Tue, Jul 29, 2014 at 08:36:33AM +0100, Chris Wilson wrote: > > > On Mon, Jul 28, 2014 at 11:26:38AM +0200, Daniel Vetter wrote: > > > > Oh, I guess that's the tricky bit why the old approach never > > > > worked > > > > - because reset_in_progress is set we failed the context/ppgtt > > > > loading through the rings and screwed up. > > > > > > > > Problem with your approach is that we want to bail out here if a > > > > reset is in progress, so we can't just eat the EAGAIN. If we do > > > > that we potentially deadlock or overflow the ring. > > > > > > > > I think we need a different hack here, and a few layers down (i.e. > > > > at the place where we actually generate that offending -EAGAIN). > > > > > > > > - Around the re-init sequence in the reset function we set > > > > dev_priv->mm.reload_in_reset or similar > > > > . Since we hold dev->struct_mutex > > > > no one will see that, as long as we never leak it out of the critical > > > > section. > > > > > > > > - In the ring_begin code that checks for gpu hangs we ignore > > > > reset_in_progress if this bit is set. > > > > > > > > - Both places need fairly big comments to explain what exactly is going > > > > on. > > > > > > This is going from bad to worse. I think you can do better if you > > > looked at the problem afresh. > > > > Well we can't really reset reset_in_progress at that point, since not all reset is done yet. Especially the modeset stuff. So I don't think that reordering the reset sequence would get us out of this ugly spot. And I don't see any other solution really. Do you? > > -Daniel > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch > > _______________________________________________ > > Intel-gfx mailing list > > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx > > http://lists.freedesktop.org/mailman/listinfo/intel-gfx > > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx