Re: [PATCH 2/3] drm/i915/guc: Add support for reset engine using GuC commands

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Mon, 30 Oct 2017 21:09:04 +0000

Quoting Michel Thierry (2017-10-30 18:56:15)
> This patch adds per engine reset and recovery (TDR) support when GuC is
> used to submit workloads to GPU.
> 
> In the case of i915 directly submission to ELSP, driver manages hang
> detection, recovery and resubmission. With GuC submission these tasks
> are shared between driver and GuC. i915 is still responsible for detecting
> a hang, and when it does it only requests GuC to reset that Engine. GuC
> internally manages acquiring forcewake and idling the engine before actually
> resetting it.
> 
> Once the reset is successful, i915 takes over again and handles resubmission.
> The scheduler in i915 knows which requests are pending so after resetting
> a engine, pending workloads/requests are resubmitted again.
> 
> v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
> non-guc funtion names.
> 
> v3: Removed debug message about engine restarting from which request,
> since the new baseline do it regardless of submission mode. (Chris)
> 
> v4: Rebase.
> 
> v5: Do not pass unnecessary reporting flags to the fw (Jeff);
> tasklet_schedule(&execlists->irq_tasklet) handles the resubmit; rebase.

In your experience, how did our test coverage fare?

Could you use live_hangcheck effectively? (The drv_selftest would need
some hand holding to pass along guc options. But for livetesting we
should probably get to the point of being able to load/unload the guc
interface so that we cover both execlists and guc.) Did you find 
"gem_exec_whisper --r hang*", did you try gem_concurrent_all?

> +/**
> + * intel_guc_reset_engine() - ask GuC to reset an engine
> + * @engine:    engine to be reset
> + */
> +int intel_guc_reset_engine(struct intel_engine_cs *engine)
> +{
> +       struct drm_i915_private *dev_priv = engine->i915;
> +       struct intel_guc *guc = &dev_priv->guc;
> +       u32 data[7];
> +
> +       GEM_BUG_ON(!guc->execbuf_client);
> +
> +       data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET;
> +       data[1] = engine->guc_id;
> +       data[2] = 0;
> +       data[3] = 0;
> +       data[4] = 0;
> +       data[5] = guc->execbuf_client->stage_id;
> +       data[6] = guc_ggtt_offset(guc->shared_data);
> +
> +       return intel_guc_send(guc, data, ARRAY_SIZE(data));

Is this a synchronous action? We expect that following the completion of
the reset routine, we are ready to reinit the hw. The same rule needs to
apply the guc, I think.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx