Re: [PATCH] drm/i915: Check hangcheck is functioning before indefinite waits

Jesse Barnes <jbarnes@xxxxxxxxxxxxxxxx> · Thu, 3 Jul 2014 08:44:20 -0700

On Thu,  3 Jul 2014 08:09:01 +0100
Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote:

> Since we rely on hangcheck to wait up and kick us out of an indefinite
> wait should the GPU ever stop functioning, it appears sensible that we
> should check that hangcheck is indeed active before starting that wait.
> This just prevents a driver error in the processing of hangcheck from
> appearing to hang the machine.
> 
> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         |  1 +
>  drivers/gpu/drm/i915/i915_gem.c         |  2 ++
>  drivers/gpu/drm/i915/i915_irq.c         | 11 +++++++++++
>  drivers/gpu/drm/i915/intel_ringbuffer.c |  2 ++
>  4 files changed, 16 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 8670d05..c326a99 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2149,6 +2149,7 @@ extern void intel_console_resume(struct work_struct *work);
>  
>  /* i915_irq.c */
>  void i915_queue_hangcheck(struct drm_device *dev);
> +void i915_check_hangcheck(struct drm_device *dev);
>  __printf(3, 4)
>  void i915_handle_error(struct drm_device *dev, bool wedged,
>  		       const char *fmt, ...);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index ecb835b..120ed6d 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1455,6 +1455,8 @@ static int __wait_seqno(struct intel_engine_cs *ring, u32 seqno,
>  			break;
>  		}
>  
> +		i915_check_hangcheck(ring->dev);
> +
>  		timer.function = NULL;
>  		if (timeout || missed_irq(dev_priv, ring)) {
>  			unsigned long expire;
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 9e5a295..daa590e 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -3063,6 +3063,17 @@ void i915_queue_hangcheck(struct drm_device *dev)
>  		  round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES));
>  }
>  
> +void i915_check_hangcheck(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	if (!i915.enable_hangcheck)
> +		return;
> +
> +	if (!timer_pending(&dev_priv->gpu_error.hangcheck_timer))
> +		mod_timer(&dev_priv->gpu_error.hangcheck_timer,
> +			  round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES));
> +}
> +
>  static void ibx_irq_reset(struct drm_device *dev)
>  {
>  	struct drm_i915_private *dev_priv = dev->dev_private;
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index 4f3397f..6365d4d 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -1634,6 +1634,8 @@ static int ring_wait_for_space(struct intel_engine_cs *ring, int n)
>  				master_priv->sarea_priv->perf_boxes |= I915_BOX_WAIT;
>  		}
>  
> +		i915_check_hangcheck(dev);
> +
>  		io_schedule_timeout(1);
>  
>  		if (dev_priv->mm.interruptible && signal_pending(current)) {

Are there any bugs associated with this?

i915_rearm_hangcheck() or something might more accurately describe
what's going on here.

I suppose both of these paths are protected by the struct_mutex?  If
not, might we race and mod_timer() this twice from two threads in
succession?  I guess that's harmless...

-- 
Jesse Barnes, Intel Open Source Technology Center
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx