Re: [PATCH 41/53] drm/i915/bdw: Avoid non-lite-restore preemptions

Daniel Vetter <daniel@xxxxxxxx> · Wed, 18 Jun 2014 22:49:26 +0200

On Fri, Jun 13, 2014 at 04:37:59PM +0100, oscar.mateo@xxxxxxxxx wrote:
> From: Oscar Mateo <oscar.mateo@xxxxxxxxx>
> 
> In the current Execlists feeding mechanism, full preemption is not
> supported yet: only lite-restores are allowed (this is: the GPU
> simply samples a new tail pointer for the context currently in
> execution).
> 
> But we have identified an scenario in which a full preemption occurs:
> 1) We submit two contexts for execution (A & B).
> 2) The GPU finishes with the first one (A), switches to the second one
> (B) and informs us.
> 3) We submit B again (hoping to cause a lite restore) together with C,
> but in the time we spend writing to the ELSP, the GPU finishes B.
> 4) The GPU start executing B again (since we told it so).
> 5) We receive a B finished interrupt and, mistakenly, we submit C (again)
> and D, causing a full preemption of B.
> 
> By keeping a better track of our submissions, we can avoid the scenario
> described above.

How? I don't see a way to fundamentally avoid the above race, and I don't
really see an issue with it - the gpu should notice that there's not
really any work done and then switch to C.

Or am I completely missing the point here?

With no clue at all this looks really scary.

> v2: elsp_submitted belongs in the new intel_ctx_submit_request. Several
> rebase changes.
> 
> Signed-off-by: Oscar Mateo <oscar.mateo@xxxxxxxxx>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 28 ++++++++++++++++++++++++----
>  drivers/gpu/drm/i915/intel_lrc.h |  2 ++
>  2 files changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 290391c..f388b28 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -248,6 +248,7 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
>  		else if (req0->ctx == cursor->ctx) {
>  			/* Same ctx: ignore first request, as second request
>  			 * will update tail past first request's workload */
> +			cursor->elsp_submitted = req0->elsp_submitted;
>  			list_del(&req0->execlist_link);
>  			queue_work(dev_priv->wq, &req0->work);
>  			req0 = cursor;
> @@ -257,8 +258,14 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
>  		}
>  	}
>  
> +	WARN_ON(req1 && req1->elsp_submitted);
> +
>  	BUG_ON(execlists_submit_context(ring, req0->ctx, req0->tail,
>  			req1? req1->ctx : NULL, req1? req1->tail : 0));

Aside: No BUG_ON except when you can prove that the kernel will die within
the current function anyway. I've seen too many cases where people
sprinkle BUG_ON instead of WARN_ON for not-completely-letal issues with
the argument that stopping the box helps debugging.

That's kinda true for initial development, but not true when shipping: The
usual result is a frustrated user/customer looking at a completely frozen
box (because someone managed to hit the BUG_ON within a spinlock that the
irq handler requires and then the machine is gone) and an equally
frustrated developer half a world away.

A dying kernel that spews useful crap into logs with his last breadth is
_much_ better, even when you know that there's no way we can ever recover
from a given situation.

</rant>

Cheers, Daniel

> +
> +	req0->elsp_submitted++;
> +	if (req1)
> +		req1->elsp_submitted++;
>  }
>  
>  static bool execlists_check_remove_request(struct intel_engine_cs *ring,
> @@ -275,9 +282,13 @@ static bool execlists_check_remove_request(struct intel_engine_cs *ring,
>  		struct drm_i915_gem_object *ctx_obj =
>  				head_req->ctx->engine[ring->id].obj;
>  		if (intel_execlists_ctx_id(ctx_obj) == request_id) {
> -			list_del(&head_req->execlist_link);
> -			queue_work(dev_priv->wq, &head_req->work);
> -			return true;
> +			WARN(head_req->elsp_submitted == 0,
> +					"Never submitted head request\n");
> +			if (--head_req->elsp_submitted <= 0) {
> +				list_del(&head_req->execlist_link);
> +				queue_work(dev_priv->wq, &head_req->work);
> +				return true;
> +			}
>  		}
>  	}
>  
> @@ -310,7 +321,16 @@ void intel_execlists_handle_ctx_events(struct intel_engine_cs *ring)
>  		status_id = I915_READ(RING_CONTEXT_STATUS_BUF(ring) +
>  				(read_pointer % 6) * 8 + 4);
>  
> -		if (status & GEN8_CTX_STATUS_COMPLETE) {
> +		if (status & GEN8_CTX_STATUS_PREEMPTED) {
> +			if (status & GEN8_CTX_STATUS_LITE_RESTORE) {
> +				if (execlists_check_remove_request(ring, status_id))
> +					WARN(1, "Lite Restored request removed from queue\n");
> +			} else
> +				WARN(1, "Preemption without Lite Restore\n");
> +		}
> +
> +		 if ((status & GEN8_CTX_STATUS_ACTIVE_IDLE) ||
> +		     (status & GEN8_CTX_STATUS_ELEMENT_SWITCH)) {
>  			if (execlists_check_remove_request(ring, status_id))
>  				submit_contexts++;
>  		}
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 7949dff..ee877aa 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -51,6 +51,8 @@ struct intel_ctx_submit_request {
>  
>  	struct list_head execlist_link;
>  	struct work_struct work;
> +
> +	int elsp_submitted;
>  };
>  
>  void intel_execlists_handle_ctx_events(struct intel_engine_cs *ring);
> -- 
> 1.9.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx