Re: [PATCH] drm/i915: Rework GPU reset sequence to match driver load & thaw

"Mcaulay, Alistair" <alistair.mcaulay@xxxxxxxxx> · Wed, 30 Jul 2014 16:59:33 +0000

Hi Daniel,

could you please be clearer on the change you mean.  I think you mean something functionally equivalent to the code below, but done in a less hacky way.
(This slight change has made no change to test results)
Or is the idea to return at a different point to this?
I couldn't find " dev_priv->mm.reload_in_reset or similar" in the code. The only thing I can find is error->reset_counter,
which is used in check_wedge(). Bottom bit set means RESET_IN_PROGRESS, top bit means WEDGED

 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
 +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
 @@ -1832,7 +1832,9 @@ int intel_ring_begin(struct intel_engine_cs 
 *ring,

  	ret = i915_gem_check_wedge(&dev_priv->gpu_error,
  				   dev_priv->mm.interruptible);
 -	if (ret)
 +
 +	/* -EAGAIN means a reset is in progress, it is Ok to return */
 +	if (ret == -EAGAIN)
 + 		return 0;
 + 	if (ret)
 +		return ret;

  	ret = __intel_ring_prepare(ring, num_dwords * sizeof(uint32_t));

Alistair.

-----Original Message-----
From: Intel-gfx [mailto:intel-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx] On Behalf Of Daniel Vetter
Sent: Tuesday, July 29, 2014 11:33 AM
To: Chris Wilson; Daniel Vetter; Ben Widawsky; intel-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re:  [PATCH] drm/i915: Rework GPU reset sequence to match driver load & thaw

On Tue, Jul 29, 2014 at 08:36:33AM +0100, Chris Wilson wrote:
> On Mon, Jul 28, 2014 at 11:26:38AM +0200, Daniel Vetter wrote:
> > Oh, I guess that's the tricky bit why the old approach never worked 
> > - because reset_in_progress is set we failed the context/ppgtt 
> > loading through the rings and screwed up.
> > 
> > Problem with your approach is that we want to bail out here if a 
> > reset is in progress, so we can't just eat the EAGAIN. If we do that 
> > we potentially deadlock or overflow the ring.
> > 
> > I think we need a different hack here, and a few layers down (i.e. 
> > at the place where we actually generate that offending -EAGAIN).
> > 
> > - Around the re-init sequence in the reset function we set
> >   dev_priv->mm.reload_in_reset or similar

. Since we hold dev->struct_mutex
> >   no one will see that, as long as we never leak it out of the critical
> >   section.
> > 
> > - In the ring_begin code that checks for gpu hangs we ignore
> >   reset_in_progress if this bit is set.
> > 
> > - Both places need fairly big comments to explain what exactly is going
> >   on.
> 
> This is going from bad to worse. I think you can do better if you 
> looked at the problem afresh.

Well we can't really reset reset_in_progress at that point, since not all reset is done yet. Especially the modeset stuff. So I don't think that reordering the reset sequence would get us out of this ugly spot. And I don't see any other solution really. Do you?
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx