From: Tim Gore <tim.gore@xxxxxxxxx> Simulated hangs, as used by drv_hangman and some other IGT tests, are not handled correctly with the new per-engine hang recovery mode. This patch fixes several issues needed to get them working in the execlist case. 1) The "simulated" hang is effected by not submitting a particular batch buffer to the hardware. In this way it is not handled by the hardware are hence remains in the software queue, leading the TDR mechanism to declare a hang. The place where the submission of the batch was being blocked was in intel_logical_ring_advance_and_submit. Because this means the request never enters the execlist_queue, the TDR mechanism does not detect a hang and the situation is never cleared. Also, blocking the batch buffer here is before the intel_ctx_submit_request object gets allocated. During TDR we need to actually complete the submission process to unhang the ring, but we are not in user context so cannot allocate the request object. To overcome both these issues I moved the place where submission is blocked to execlists_context_unqueue. This means that the request enters the ring->execlist_queue, so the TDR mechanism detects the hang and can resubmit the request after the stop_rings bit is cleared. 2) A further problem arises from a workaround in i915_hangcheck_sample to deal with a context submission status of "...INCONSISTENT" being reported by intel_execlists_TDR_get_current_request() when the hardware is idle. A simulated hang, because it causes the sw and hw context id's to be out of sync, results in a context submission status of "...INCONSISTENT" being reported, triggering this workaround which resubmits the batch and clears the hang, avoiding a ring reset. But we want the ring reset to occur, since this is part of what we are testing. So, I have made intel_execlists_TDR_get_current_request() aware of simulated hangs, so that it returns a status of OK in this case. This avoids the workaround being triggered, leading to the TDR mechanism declaring a ring hang and doing a ring reset. Issue: VIZ-5488 Signed-off-by: Tim Gore <tim.gore@xxxxxxxxx> Signed-off-by: Tomas Elf <tomas.elf@xxxxxxxxx> --- drivers/gpu/drm/i915/i915_drv.c | 10 ++++++++++ drivers/gpu/drm/i915/intel_lrc.c | 26 +++++++++++++++++++++----- 2 files changed, 31 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index d482341..ecb3eb1 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -1175,6 +1175,16 @@ int i915_reset_engine(struct intel_engine_cs *engine) } } + /* Clear any simulated hang flags */ + if (dev_priv->gpu_error.stop_rings) { + DRM_INFO("Simulated gpu hang, reset stop_rings bits %08x\n", + (0x1 << engine->id)); + dev_priv->gpu_error.stop_rings &= ~(0x1 << engine->id); + /* if all hangs are cleared, then clear the ALLOW_BAN/ERROR bits */ + if ((dev_priv->gpu_error.stop_rings & ((1 << I915_NUM_RINGS) - 1)) == 0) + dev_priv->gpu_error.stop_rings = 0; + } + /* Sample the current ring head position */ head = I915_READ_HEAD(engine) & HEAD_ADDR; diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 96f5a28..284baab 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -641,6 +641,16 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring, bool tdr_res } } + /* Check for a simulated hang request */ + if (intel_ring_stopped(ring)) { + /* + * Mark the request at the head of the queue as submitted but + * dont actually submit it. + */ + req0->elsp_submitted++; + return; + } + WARN_ON(req1 && req1->elsp_submitted && !tdr_resubmission); execlists_submit_requests(req0, req1, tdr_resubmission); @@ -1006,16 +1016,12 @@ static int logical_ring_wait_for_space(struct drm_i915_gem_request *req, static void intel_logical_ring_advance_and_submit(struct drm_i915_gem_request *request) { - struct intel_engine_cs *ring = request->ring; struct drm_i915_private *dev_priv = request->i915; intel_logical_ring_advance(request->ringbuf); request->tail = request->ringbuf->tail; - if (intel_ring_stopped(ring)) - return; - if (dev_priv->guc.execbuf_client) i915_guc_submit(dev_priv->guc.execbuf_client, request); else @@ -3250,7 +3256,17 @@ intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring, } if (tmpctx) { - status = ((hw_context == sw_context) && hw_active) ? + /* + * Check for simuated hang. In this case the head entry in the + * sw execlist queue will not have been submitted to the ELSP, so + * the hw and sw context id's may well disagree, but we still want + * to proceed with hang recovery. So we return OK which allows + * the TDR recovery mechanism to proceed with a ring reset. + */ + if (intel_ring_stopped(ring)) + status = CONTEXT_SUBMISSION_STATUS_OK; + else + status = ((hw_context == sw_context) && hw_active) ? CONTEXT_SUBMISSION_STATUS_OK : CONTEXT_SUBMISSION_STATUS_INCONSISTENT; } else { -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx