As a final paranoid step (we _should_ have reset the GPU on suspending the device prior to unload), reset the GPU once more before removing the powercontext and other related power saving paraphernalia. A clue that this may not be the case is <7> [313.203721] __intel_gt_set_wedged rcs'0 <7> [313.203746] __intel_gt_set_wedged Awake? 3 <7> [313.203751] __intel_gt_set_wedged Barriers?: no <7> [313.203756] __intel_gt_set_wedged Latency: 0us <7> [313.203762] __intel_gt_set_wedged Reset count: 0 (global 0) <7> [313.203766] __intel_gt_set_wedged Requests: <7> [313.203785] __intel_gt_set_wedged MMIO base: 0x00002000 <7> [313.203819] __intel_gt_set_wedged RING_START: 0x00000000 <7> [313.203826] __intel_gt_set_wedged RING_HEAD: 0x00000000 <7> [313.203833] __intel_gt_set_wedged RING_TAIL: 0x00000000 <7> [313.203844] __intel_gt_set_wedged RING_CTL: 0x00000000 <7> [313.203854] __intel_gt_set_wedged RING_MODE: 0x00000000 <7> [313.203861] __intel_gt_set_wedged RING_IMR: fffffefe <7> [313.203875] __intel_gt_set_wedged ACTHD: 0x00000000_00000000 <7> [313.203888] __intel_gt_set_wedged BBADDR: 0x00000000_00000000 <7> [313.203901] __intel_gt_set_wedged DMA_FADDR: 0x00000000_00000000 <7> [313.203909] __intel_gt_set_wedged IPEIR: 0x00000000 <7> [313.203916] __intel_gt_set_wedged IPEHR: 0xcccccccc <7> [313.203921] __intel_gt_set_wedged Execlist tasklet queued? no (enabled), preempt? inactive, timeslice? inactive <7> [313.203932] __intel_gt_set_wedged Execlist status: 0x00044032 00000020; CSB read:5, write:0, entries:6 <7> [313.203937] __intel_gt_set_wedged Execlist CSB[0]: 0x00000001, context: 0 <7> [313.203952] __intel_gt_set_wedged Pending[0] ring:{start:000c4000, hwsp:fedfc000, seqno:00000000}, rq: 402e:2- prio=2147483647 @ 207ms: [i915] <7> [313.203983] __intel_gt_set_wedged E 402e:2- prio=2147483647 @ 207ms: [i915] <7> [313.204006] __intel_gt_set_wedged Queue priority hint: 3 during rapid fault-injection reloads. 0xcc is POISON_FREE_INIT which suggests that the system cleared the pages on initialisation as they are still being used from the previous module load. Despite that we also have a couple of GPU resets prior to this... I have a sneaky suspicion that may be a GuC artifact. Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> Cc: Andi Shyti <andi.shyti@xxxxxxxxx> Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> drm/i915/gt: Lift clearing GT wedged out of gt_sanitize We only want to try and reset a wedged device on resume, not before suspend, so lift the recovery out of the commont gt_sanitize(). Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> Cc: Andi Shyti <andi.shyti@xxxxxxxxx> Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> --- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index d1c2f034296a..c039185c4bd2 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -138,17 +138,6 @@ static void gt_sanitize(struct intel_gt *gt, bool force) wakeref = intel_runtime_pm_get(gt->uncore->rpm); intel_uncore_forcewake_get(gt->uncore, FORCEWAKE_ALL); - /* - * As we have just resumed the machine and woken the device up from - * deep PCI sleep (presumably D3_cold), assume the HW has been reset - * back to defaults, recovering from whatever wedged state we left it - * in and so worth trying to use the device once more. - */ - if (intel_gt_is_wedged(gt)) - intel_gt_unset_wedged(gt); - - intel_uc_sanitize(>->uc); - for_each_engine(engine, gt, id) if (engine->reset.prepare) engine->reset.prepare(engine); @@ -170,6 +159,7 @@ static void gt_sanitize(struct intel_gt *gt, bool force) void intel_gt_pm_fini(struct intel_gt *gt) { + intel_gt_set_wedged(gt); intel_rc6_fini(>->rc6); } @@ -194,7 +184,19 @@ int intel_gt_resume(struct intel_gt *gt) intel_gt_pm_get(gt); intel_uncore_forcewake_get(gt->uncore, FORCEWAKE_ALL); + intel_rc6_sanitize(>->rc6); + intel_uc_sanitize(>->uc); + + /* + * As we have just resumed the machine and woken the device up from + * deep PCI sleep (presumably D3_cold), assume the HW has been reset + * back to defaults, recovering from whatever wedged state we left it + * in and so worth trying to use the device once more. + */ + if (intel_gt_is_wedged(gt)) + intel_gt_unset_wedged(gt); + gt_sanitize(gt, true); if (intel_gt_is_wedged(gt)) { err = -EIO; @@ -308,8 +310,7 @@ void intel_gt_suspend_late(struct intel_gt *gt) intel_llc_disable(>->llc); } - gt_sanitize(gt, false); - + intel_gt_set_wedged(gt); GT_TRACE(gt, "\n"); } -- 2.25.0.rc2 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx