Re: [PATCH 2/5] drm/i915/gt: Push engine stopping into reset-prepare

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 17/07/2019 14:30, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2019-07-17 14:21:50)

On 17/07/2019 14:08, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2019-07-17 14:04:34)

On 16/07/2019 13:49, Chris Wilson wrote:
Push the engine stop into the back reset_prepare (where it already was!)
This allows us to avoid dangerously setting the RING registers to 0 for
logical contexts. If we clear the register on a live context, those
invalid register values are recorded in the logical context state and
replayed (with hilarious results).

So essentially statement is gen3_stop_engine is not needed and even
dangerous with execlists?

Yes. It has been a nuisance in the past, which is why we try to avoid
it. I have come to conclusion that it serves no purpose for execlists
and only makes recovery worse.



Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
---
    drivers/gpu/drm/i915/gt/intel_lrc.c        | 16 +++++-
    drivers/gpu/drm/i915/gt/intel_reset.c      | 58 ----------------------
    drivers/gpu/drm/i915/gt/intel_ringbuffer.c | 40 ++++++++++++++-
    3 files changed, 53 insertions(+), 61 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 9e0992498087..9b87a2fc186c 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2173,11 +2173,23 @@ static void execlists_reset_prepare(struct intel_engine_cs *engine)
        __tasklet_disable_sync_once(&execlists->tasklet);
        GEM_BUG_ON(!reset_in_progress(execlists));
- intel_engine_stop_cs(engine);
-
        /* And flush any current direct submission. */
        spin_lock_irqsave(&engine->active.lock, flags);
        spin_unlock_irqrestore(&engine->active.lock, flags);
+
+     /*
+      * We stop engines, otherwise we might get failed reset and a
+      * dead gpu (on elk). Also as modern gpu as kbl can suffer
+      * from system hang if batchbuffer is progressing when
+      * the reset is issued, regardless of READY_TO_RESET ack.
+      * Thus assume it is best to stop engines on all gens
+      * where we have a gpu reset.
+      *
+      * WaKBLVECSSemaphoreWaitPoll:kbl (on ALL_ENGINES)
+      *
+      * FIXME: Wa for more modern gens needs to be validated
+      */
+     intel_engine_stop_cs(engine);
    }
static void reset_csb_pointers(struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index 7ddedfb16aa2..55e2ddcbd215 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -135,47 +135,6 @@ void __i915_request_reset(struct i915_request *rq, bool guilty)
        }
    }
-static void gen3_stop_engine(struct intel_engine_cs *engine)
-{
-     struct intel_uncore *uncore = engine->uncore;
-     const u32 base = engine->mmio_base;
-
-     GEM_TRACE("%s\n", engine->name);
-
-     if (intel_engine_stop_cs(engine))
-             GEM_TRACE("%s: timed out on STOP_RING\n", engine->name);
-
-     intel_uncore_write_fw(uncore,
-                           RING_HEAD(base),
-                           intel_uncore_read_fw(uncore, RING_TAIL(base)));
-     intel_uncore_posting_read_fw(uncore, RING_HEAD(base)); /* paranoia */
-
-     intel_uncore_write_fw(uncore, RING_HEAD(base), 0);
-     intel_uncore_write_fw(uncore, RING_TAIL(base), 0);
-     intel_uncore_posting_read_fw(uncore, RING_TAIL(base));
-
-     /* The ring must be empty before it is disabled */
-     intel_uncore_write_fw(uncore, RING_CTL(base), 0);
-
-     /* Check acts as a post */
-     if (intel_uncore_read_fw(uncore, RING_HEAD(base)))
-             GEM_TRACE("%s: ring head [%x] not parked\n",
-                       engine->name,
-                       intel_uncore_read_fw(uncore, RING_HEAD(base)));
-}
-
-static void stop_engines(struct intel_gt *gt, intel_engine_mask_t engine_mask)
-{
-     struct intel_engine_cs *engine;
-     intel_engine_mask_t tmp;
-
-     if (INTEL_GEN(gt->i915) < 3)
-             return;
-
-     for_each_engine_masked(engine, gt->i915, engine_mask, tmp)
-             gen3_stop_engine(engine);
-}
-
    static bool i915_in_reset(struct pci_dev *pdev)
    {
        u8 gdrst;
@@ -607,23 +566,6 @@ int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask)
         */
        intel_uncore_forcewake_get(gt->uncore, FORCEWAKE_ALL);
        for (retry = 0; ret == -ETIMEDOUT && retry < retries; retry++) {
-             /*
-              * We stop engines, otherwise we might get failed reset and a
-              * dead gpu (on elk). Also as modern gpu as kbl can suffer
-              * from system hang if batchbuffer is progressing when
-              * the reset is issued, regardless of READY_TO_RESET ack.
-              * Thus assume it is best to stop engines on all gens
-              * where we have a gpu reset.
-              *
-              * WaKBLVECSSemaphoreWaitPoll:kbl (on ALL_ENGINES)
-              *
-              * WaMediaResetMainRingCleanup:ctg,elk (presumably)
-              *
-              * FIXME: Wa for more modern gens needs to be validated
-              */
-             if (retry)
-                     stop_engines(gt, engine_mask);
-

Only other functional change I see is that we stop retrying to stop the
engines before reset attempts. I don't know if that is a concern or not.

Ah, but we do stop the engine before resets in *reset_prepare. The other
path to arrive is in sanitize where we don't know enough state to safely
tweak the engines. For those, I claim it shouldn't matter as the engines
should be idle and we only need the reset to clear stale context state.

Yes I know that we do call stop in prepare, just not on the reset retry
path. It's the above loop, if reset was failing and needed retries
before we would re-retried stopping engines and now we would not.

The engines are still stopped. The functional change is to remove the
dangerous clearing of RING_HEAD/CTL.

Okay for execlists, but for ringbuffer I was simply asking if _one_ of the reasons for failed reset could be failure to stop cs. In which case removal of stop_engines from the retry loop might be detrimental for ringbuffer.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux