Re: [PATCH] drm/i915: Reboot CI if forcewake fails

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes:

> If the HW fail to ack a change in forcewake status, the machine is as
> good as dead -- it may recover, but in reality it missed the mmio
> updates and is now in a very inconsistent state. If it happens, we can't
> trust the CI results (or at least the fails may be genuine but due to
> the HW being dead and not the actual test!) so reboot the machine (CI
> checks for a kernel taint in between each test and reboots if the
> machine is tainted).
>
> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>

Sounds and looks reasonable. Should we also taint if we have
unclaimed mmio after init sequence?

Reviewed-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>

> ---
>  drivers/gpu/drm/i915/gt/intel_reset.c |  2 +-
>  drivers/gpu/drm/i915/i915_drv.h       | 11 +++++++++++
>  drivers/gpu/drm/i915/intel_uncore.c   |  8 ++++++--
>  3 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index 419b3415370b..464369bc55ad 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -1042,7 +1042,7 @@ void i915_reset(struct drm_i915_private *i915,
>  	 * rather than continue on into oblivion. For everyone else,
>  	 * the system should still plod along, but they have been warned!
>  	 */
> -	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
> +	add_taint_for_CI(TAINT_WARN);
>  error:
>  	__i915_gem_set_wedged(i915);
>  	goto finish;
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 0a6ec61496f1..d0257808734c 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3375,4 +3375,15 @@ static inline u32 i915_scratch_offset(const struct drm_i915_private *i915)
>  	return i915_ggtt_offset(i915->gt.scratch);
>  }
>  
> +static inline void add_taint_for_CI(unsigned int taint)
> +{
> +	/*
> +	 * The system is "ok", just about surviving for the user, but
> +	 * CI results are now unreliable as the HW is very suspect.
> +	 * CI checks the taint state after every test and will reboot
> +	 * the machine if the kernel is tainted.
> +	 */
> +	add_taint(taint, LOCKDEP_STILL_OK);
> +}
> +
>  #endif
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index d1d51e1121e2..6ec1bc97b665 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -111,9 +111,11 @@ wait_ack_set(const struct intel_uncore_forcewake_domain *d,
>  static inline void
>  fw_domain_wait_ack_clear(const struct intel_uncore_forcewake_domain *d)
>  {
> -	if (wait_ack_clear(d, FORCEWAKE_KERNEL))
> +	if (wait_ack_clear(d, FORCEWAKE_KERNEL)) {
>  		DRM_ERROR("%s: timed out waiting for forcewake ack to clear.\n",
>  			  intel_uncore_forcewake_domain_to_str(d->id));
> +		add_taint_for_CI(TAINT_WARN); /* CI unreliable */
> +	}
>  }
>  
>  enum ack_type {
> @@ -186,9 +188,11 @@ fw_domain_get(const struct intel_uncore_forcewake_domain *d)
>  static inline void
>  fw_domain_wait_ack_set(const struct intel_uncore_forcewake_domain *d)
>  {
> -	if (wait_ack_set(d, FORCEWAKE_KERNEL))
> +	if (wait_ack_set(d, FORCEWAKE_KERNEL)) {
>  		DRM_ERROR("%s: timed out waiting for forcewake ack request.\n",
>  			  intel_uncore_forcewake_domain_to_str(d->id));
> +		add_taint_for_CI(TAINT_WARN); /* CI unreliable */
> +	}
>  }
>  
>  static inline void
> -- 
> 2.20.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux