Quoting Michel Thierry (2017-10-30 18:56:15) > This patch adds per engine reset and recovery (TDR) support when GuC is > used to submit workloads to GPU. > > In the case of i915 directly submission to ELSP, driver manages hang > detection, recovery and resubmission. With GuC submission these tasks > are shared between driver and GuC. i915 is still responsible for detecting > a hang, and when it does it only requests GuC to reset that Engine. GuC > internally manages acquiring forcewake and idling the engine before actually > resetting it. > > Once the reset is successful, i915 takes over again and handles resubmission. > The scheduler in i915 knows which requests are pending so after resetting > a engine, pending workloads/requests are resubmitted again. > > v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the > non-guc funtion names. > > v3: Removed debug message about engine restarting from which request, > since the new baseline do it regardless of submission mode. (Chris) > > v4: Rebase. > > v5: Do not pass unnecessary reporting flags to the fw (Jeff); > tasklet_schedule(&execlists->irq_tasklet) handles the resubmit; rebase. In your experience, how did our test coverage fare? Could you use live_hangcheck effectively? (The drv_selftest would need some hand holding to pass along guc options. But for livetesting we should probably get to the point of being able to load/unload the guc interface so that we cover both execlists and guc.) Did you find "gem_exec_whisper --r hang*", did you try gem_concurrent_all? > +/** > + * intel_guc_reset_engine() - ask GuC to reset an engine > + * @engine: engine to be reset > + */ > +int intel_guc_reset_engine(struct intel_engine_cs *engine) > +{ > + struct drm_i915_private *dev_priv = engine->i915; > + struct intel_guc *guc = &dev_priv->guc; > + u32 data[7]; > + > + GEM_BUG_ON(!guc->execbuf_client); > + > + data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET; > + data[1] = engine->guc_id; > + data[2] = 0; > + data[3] = 0; > + data[4] = 0; > + data[5] = guc->execbuf_client->stage_id; > + data[6] = guc_ggtt_offset(guc->shared_data); > + > + return intel_guc_send(guc, data, ARRAY_SIZE(data)); Is this a synchronous action? We expect that following the completion of the reset routine, we are ready to reinit the hw. The same rule needs to apply the guc, I think. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx