Re: [PATCH] drm/i915/execlists: Clear STOP_RING bit before restoring the context

Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> · Tue, 14 Aug 2018 16:42:41 +0300

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes:

> Before a reset, we set the STOP_RING bit of RING_MI_MODE to freeze the
> engine. However, sometimes we observe that upon restart, the engine
> freezes again with the STOP_RING bit still asserted. By inspection, we
> know that the register itself is cleared by the GPU reset, so that bit
> must be preserved inside the context image and reloaded from there. A
> simple fix (as the RING_MI_MODE is at a fixed offset in a valid context)
> is to clobber the STOP_RING bit inside the image before replaying the
> request.
>
> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> Cc: Michel Thierry <michel.thierry@xxxxxxxxx>
> Cc: Michał Winiarski <michal.winiarski@xxxxxxxxx>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c     | 17 +++++++++++++++--
>  drivers/gpu/drm/i915/intel_lrc_reg.h |  2 ++
>  2 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 3f90c74038ef..37fe842de639 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1918,6 +1918,20 @@ static void execlists_reset(struct intel_engine_cs *engine,
>  
>  	spin_unlock_irqrestore(&engine->timeline.lock, flags);
>  
> +	if (!request)
> +		return;
> +
> +	regs = request->hw_context->lrc_reg_state;
> +
> +	/*
> +	 * After a reset, the context may have preserved the STOP bit
> +	 * of RING_MI_MODE we used to freeze the active engine before
> +	 * the reset. If that bit is restored the ring stops instead
> +	 * of being executed.
> +	 */
> +	regs[CTX_MI_MODE + 1] |= STOP_RING << 16;
> +	regs[CTX_MI_MODE + 1] &= ~STOP_RING;

Ok, I did go and pondered if this is truely the simplest
way. Forcing the start outside a context would not
be simpler and would be slower.

Nice find and it will has potential to explain our some troubles
of re-enabling engines.

Did you find out this by looking at the error states or what?

Reviewed-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>

> +
>  	/*
>  	 * If the request was innocent, we leave the request in the ELSP
>  	 * and will try to replay it on restarting. The context image may
> @@ -1929,7 +1943,7 @@ static void execlists_reset(struct intel_engine_cs *engine,
>  	 * and have to at least restore the RING register in the context
>  	 * image back to the expected values to skip over the guilty request.
>  	 */
> -	if (!request || request->fence.error != -EIO)
> +	if (request->fence.error != -EIO)
>  		return;
>  
>  	/*
> @@ -1940,7 +1954,6 @@ static void execlists_reset(struct intel_engine_cs *engine,
>  	 * future request will be after userspace has had the opportunity
>  	 * to recreate its own state.
>  	 */
> -	regs = request->hw_context->lrc_reg_state;
>  	if (engine->pinned_default_state) {
>  		memcpy(regs, /* skip restoring the vanilla PPHWSP */
>  		       engine->pinned_default_state + LRC_STATE_PN * PAGE_SIZE,
> diff --git a/drivers/gpu/drm/i915/intel_lrc_reg.h b/drivers/gpu/drm/i915/intel_lrc_reg.h
> index 5ef932d810a7..3b155ecbfa92 100644
> --- a/drivers/gpu/drm/i915/intel_lrc_reg.h
> +++ b/drivers/gpu/drm/i915/intel_lrc_reg.h
> @@ -39,6 +39,8 @@
>  #define CTX_R_PWR_CLK_STATE		0x42
>  #define CTX_END				0x44
>  
> +#define CTX_MI_MODE			0x54
> +
>  #define CTX_REG(reg_state, pos, reg, val) do { \
>  	u32 *reg_state__ = (reg_state); \
>  	const u32 pos__ = (pos); \
> -- 
> 2.18.0
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx