On 30/10/17 14:09, Chris Wilson wrote:
Quoting Michel Thierry (2017-10-30 18:56:15)
This patch adds per engine reset and recovery (TDR) support when GuC is
used to submit workloads to GPU.
In the case of i915 directly submission to ELSP, driver manages hang
detection, recovery and resubmission. With GuC submission these tasks
are shared between driver and GuC. i915 is still responsible for detecting
a hang, and when it does it only requests GuC to reset that Engine. GuC
internally manages acquiring forcewake and idling the engine before actually
resetting it.
Once the reset is successful, i915 takes over again and handles resubmission.
The scheduler in i915 knows which requests are pending so after resetting
a engine, pending workloads/requests are resubmitted again.
v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
non-guc funtion names.
v3: Removed debug message about engine restarting from which request,
since the new baseline do it regardless of submission mode. (Chris)
v4: Rebase.
v5: Do not pass unnecessary reporting flags to the fw (Jeff);
tasklet_schedule(&execlists->irq_tasklet) handles the resubmit; rebase.
In your experience, how did our test coverage fare?
Could you use live_hangcheck effectively? (The drv_selftest would need
some hand holding to pass along guc options. But for livetesting we
should probably get to the point of being able to load/unload the guc
interface so that we cover both execlists and guc.) Did you find
"gem_exec_whisper --r hang*", did you try gem_concurrent_all?
live_hangcheck runs ok with guc (as long as i915_params.h has guc
submission enabled). Do you see a benefit on adding an option in
drv_selftest to override the submission mode? I can add it to my list.
You got me in gem_concurrent_all, I forgot to schedule it a few weeks ago.
+/**
+ * intel_guc_reset_engine() - ask GuC to reset an engine
+ * @engine: engine to be reset
+ */
+int intel_guc_reset_engine(struct intel_engine_cs *engine)
+{
+ struct drm_i915_private *dev_priv = engine->i915;
+ struct intel_guc *guc = &dev_priv->guc;
+ u32 data[7];
+
+ GEM_BUG_ON(!guc->execbuf_client);
+
+ data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET;
+ data[1] = engine->guc_id;
+ data[2] = 0;
+ data[3] = 0;
+ data[4] = 0;
+ data[5] = guc->execbuf_client->stage_id;
+ data[6] = guc_ggtt_offset(guc->shared_data);
+
+ return intel_guc_send(guc, data, ARRAY_SIZE(data));
Is this a synchronous action? We expect that following the completion of
the reset routine, we are ready to reinit the hw. The same rule needs to
apply the guc, I think.
Right now the action is synchronous, the fw won't reply to the action
until all the steps are completed. It also is fast enough, I haven't
seen it time out (which would be promoted to full reset and reload the
fw). But, do you have a crystal ball?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx