On 3/2/2023 1:50 PM, Zhanjun Dong wrote:
Add err code check for enable_communication on resume path. When resume failed, we can no longer use the GPU, marking the GPU as wedged.
This is slightly incorrect. If we fail to enable communication, the
consequence is that we can't use the GuC. That translates to a failure
to use the GT only if GuC submission is enabled; in execlists submission
mode we can keep going, although we might end up losing HuC
functionality (which is not considered a fatal error). Therefore, the
code below should be updated to reflect this.
Signed-off-by: Zhanjun Dong <zhanjun.dong@xxxxxxxxx>
---
drivers/gpu/drm/i915/gt/intel_gt_pm.c | 7 ++++++-
drivers/gpu/drm/i915/gt/uc/intel_uc.c | 9 +++++++--
2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index cef3d6f5c34e..42a7d75ce39e 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -401,8 +401,13 @@ int intel_gt_runtime_resume(struct intel_gt *gt)
intel_ggtt_restore_fences(gt->ggtt);
ret = intel_uc_runtime_resume(>->uc);
- if (ret)
+ if (ret) {
+ /* Resume failed, we can no longer use the GPU, marking the GPU
+ * as wedged.
+ */
+ intel_gt_set_wedged(gt);
intel_gt_set_wedged calls intel_runtime_pm_get, so it will deadlock if
you call if from within the runtime_resume flow.
return ret;
+ }
return 0;
}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
index 6648691bd645..d4f428acf20a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
@@ -698,8 +698,13 @@ static int __uc_resume(struct intel_uc *uc, bool enable_communication)
/* Make sure we enable communication if and only if it's disabled */
GEM_BUG_ON(enable_communication == intel_guc_ct_enabled(&guc->ct));
- if (enable_communication)
- guc_enable_communication(guc);
+ if (enable_communication) {
+ err = guc_enable_communication(guc);
+ if (err) {
+ guc_dbg(guc, "Failed to resume, %pe", ERR_PTR(err));
nit: this isn't exactly a failure to resume because the GuC is running.
It's more a failure to re-establish communication with the GuC.
Daniele
+ return err;
+ }
+ }
/* If we are only resuming GuC communication but not reloading
* GuC, we need to ensure the ARAT timer interrupt is enabled