During repeated wedged-unwedged, it is found that i915_request_retire zaps the old request with 0x6b6b6b6b. On unwedged, we write a new request at RING_TAIL, expecting to start executuing from that position, but execution resumes from RING_HEAD (preserved from an earlier wakeup before wedging) and consumes the 0x6b. Resetting kernel/user context setup enables RING_HEAD to use RING_TAIL for submitting new requests which resolves issue. Normally this reset is applied when unpinning a user context, or for kernel_contexts upon waking up the device. But fast wedged-unwedged sequence will keep the device awake, preserving the RING_HEAD from before. Testcase: igt@gem_eio@unwedge-stress Note : Current user impact is assessed to be low, as this only affects intel_gt_unset_wedged which is currently only used during testing and upon suspend resume (where the device was already flushed and will reset the kernel_contexts on waking up). In the future though, this will present an issue for PCI error recovery. Signed-off-by: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@xxxxxxxxx> --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 8 ++++++++ drivers/gpu/drm/i915/gt/intel_reset.c | 3 +++ 2 files changed, 11 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 352254e001b4..7e1c561bce69 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -1455,9 +1455,17 @@ void intel_engines_reset_default_submission(struct intel_gt *gt) enum intel_engine_id id; for_each_engine(engine, gt, id) { + struct intel_context *ce = engine->kernel_context; + if (engine->sanitize) engine->sanitize(engine); + /* Reset RING_HEAD so we don't consume the old + * poisoned request on unwedging + */ + if (ce) + ce->ops->reset(ce); + engine->set_default_submission(engine); } } diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index 7be0002d9d70..1c26e936e699 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -961,6 +961,9 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt) } spin_unlock(&timelines->lock); + /* Ensure that all non-kernel contexts are unpinned as well */ + intel_gt_retire_requests(gt); + /* We must reset pending GPU events before restoring our submission */ ok = !HAS_EXECLISTS(gt->i915); /* XXX better agnosticism desired */ if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display) -- 2.31.1