Re: [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 19/12/16 01:51, Chris Wilson wrote:
On Sun, Dec 18, 2016 at 09:02:33PM -0800, Michel Thierry wrote:


On 12/16/2016 12:45 PM, Chris Wilson wrote:
On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote:
From: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx>

This change implements support for per-engine reset as an initial, less
intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. This is only supported
>from Gen8 onwards.

Hangchecker determines which engines are hung and invokes error handler to
recover from it. Error handler schedules recovery for each of those engines
that are hung. The recovery procedure is as follows,
- identifies the request that caused the hang and it is dropped
- force engine to idle: this is done by issuing a reset request
- reset and re-init engine
- restart submissions to the engine

If engine reset fails then we fall back to heavy weight full gpu reset
which resets all engines and reinitiazes complete state of HW and SW.

v2: Rebase.

Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
Signed-off-by: Tomas Elf <tomas.elf@xxxxxxxxx>
Signed-off-by: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx>
Signed-off-by: Michel Thierry <michel.thierry@xxxxxxxxx>
---
drivers/gpu/drm/i915/i915_drv.c     | 56 +++++++++++++++++++++++++++++++++++--
drivers/gpu/drm/i915/i915_drv.h     |  3 ++
drivers/gpu/drm/i915/i915_gem.c     |  2 +-
drivers/gpu/drm/i915/intel_lrc.c    | 12 ++++++++
drivers/gpu/drm/i915/intel_lrc.h    |  1 +
drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
6 files changed, 108 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index e5688edd62cd..a034793bc246 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
 *
 * Reset a specific GPU engine. Useful if a hang is detected.
 * Returns zero on successful reset or otherwise an error code.
+ *
+ * Procedure is fairly simple:
+ *  - identifies the request that caused the hang and it is dropped
+ *  - force engine to idle: this is done by issuing a reset request
+ *  - reset engine
+ *  - restart submissions to the engine
 */
int i915_reset_engine(struct intel_engine_cs *engine)

What's the serialisation between potential callers of
i915_reset_engine()?


I haven't seen simultaneous calls happening yet, would a
reset_engine-specific mutex be enough?

That is what it more or less boils down to. But first, do we ensure we
don't declare a second hang on this engine before the first reset is
complete? Then we only a barrier between a single engine reset and a
full reset.

{
	int ret;
	struct drm_i915_private *dev_priv = engine->i915;

-	/* FIXME: replace me with engine reset sequence */
-	ret = -ENODEV;
+	/*
+	 * We need to first idle the engine by issuing a reset request,
+	 * then perform soft reset and re-initialize hw state, for all of
+	 * this GT power need to be awake so ensure it does throughout the
+	 * process
+	 */
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+	/*
+	 * the request that caused the hang is stuck on elsp, identify the
+	 * active request and drop it, adjust head to skip the offending
+	 * request to resume executing remaining requests in the queue.
+	 */
+	i915_gem_reset_engine(engine);

Must freeze the engine and irqs first, before calling
i915_gem_reset_engine() (i.e. something like disable_engines_irq,
cancelling tasklet)

Will do.

Eeek note that the current i915_gem_reset_engine() is lacking a
spinlock.


The new mutex (for i915_reset_engine) should cover this.

No. We need to protect these lists from concurrent manipulation in the
fences. The spinlock protection there needs to be extended here, being
covered by struct_mutex is no longer sufficient protection for the
current code.

I think you already addressed the last point in "drm/i915: Prevent timeline updates whilst performing reset" (00c25e3f40083a6d5f1111955baccd287ee49258), right?

But looking at these (and the interaction with possible waiters), do you think it is still achievable to reset a engine without grabbing the struct_mutex? Then both i915_reset (chip) and i915_reset_engine can use the "reset_in_progress".

Thanks
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux