Re: [PATCH] drm/i915/guc: Propagate the fw xfer timeout

"Michal Wajdeczko" <michal.wajdeczko@xxxxxxxxx> · Thu, 18 Oct 2018 20:27:20 +0200

On Thu, 18 Oct 2018 20:18:53 +0200, Daniele Ceraolo Spurio  
<daniele.ceraolospurio@xxxxxxxxx> wrote:



On 18/10/18 02:13, Chris Wilson wrote:
Quoting Michal Wajdeczko (2018-10-18 00:22:43)
On Thu, 18 Oct 2018 01:09:19 +0200, Daniele Ceraolo Spurio
<daniele.ceraolospurio@xxxxxxxxx> wrote:



On 17/10/18 13:29, Chris Wilson wrote:
Propagate the timeout on transferring the fw back to the caller  
where it
may act upon it, usually by restarting the xfer before failing.


Did you see any case where we failed the xfer and didn't get a timeout
out of guc_wait_ucode? that'd be quite weird
 Yes the logs show the xfer error but not the wait error. So we ended up
returning 0.


The guc status register is correctly cleared by HW on guc reset (just  
checked that), so if the wait_ucode succeeded in matching a "ready"  
value in there it means that the guc FW had loaded correctly. Maybe it  
managed to complete the xfer just after the timeout elapsed or the DMA  
got confused and reported the wrong status. Still weird, but your  
changes should help make the error more visible.

maybe first timeout in DMA transfer was mitigated by additional wait in
guc_wait_ucode() - so maybe this error propagation is not that important ?


Anyway, definitely better and cleaner than before:
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@xxxxxxxxx>

small nitpick below

Testcase: igt/drv_selftest/live_hangcheck
Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Michal Wajdeczko <michal.wajdeczko@xxxxxxxxx>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@xxxxxxxxx>
---
   drivers/gpu/drm/i915/intel_guc_fw.c | 23 ++++++-----------------
   1 file changed, 6 insertions(+), 17 deletions(-)
  diff --git a/drivers/gpu/drm/i915/intel_guc_fw.c
b/drivers/gpu/drm/i915/intel_guc_fw.c
index a9e6fcce467c..b68a05892dab 100644

--- a/drivers/gpu/drm/i915/intel_guc_fw.c
+++ b/drivers/gpu/drm/i915/intel_guc_fw.c
@@ -125,22 +125,17 @@ static void guc_prepare_xfer(struct intel_guc
*guc)
   }
     /* Copy RSA signature from the fw image to HW for verification  
*/
-static int guc_xfer_rsa(struct intel_guc *guc, struct i915_vma *vma)
+static void guc_xfer_rsa(struct intel_guc *guc, struct i915_vma  
*vma)
   {
      struct drm_i915_private *dev_priv = guc_to_i915(guc);
-    struct intel_uc_fw *guc_fw = &guc->fw;
-    struct sg_table *sg = vma->pages;
      u32 rsa[UOS_RSA_SCRATCH_COUNT];
      int i;
   -  if (sg_pcopy_to_buffer(sg->sgl, sg->nents, rsa, sizeof(rsa),
-                           guc_fw->rsa_offset) != sizeof(rsa))
-            return -EINVAL;
+    sg_pcopy_to_buffer(vma->pages->sgl, vma->pages->nents,
+                       rsa, sizeof(rsa), guc->fw.rsa_offset);
      for (i = 0; i < UOS_RSA_SCRATCH_COUNT; i++)
              I915_WRITE(UOS_RSA_SCRATCH(i), rsa[i]);
-
-    return 0;
   }
     /*
@@ -251,17 +246,11 @@ static int guc_fw_xfer(struct intel_uc_fw
*guc_fw, struct i915_vma *vma)
       * by the DMA engine in one operation, whereas the RSA  
signature is
       * loaded via MMIO.
       */
-    ret = guc_xfer_rsa(guc, vma);
-    if (ret)
-            DRM_WARN("GuC firmware signature xfer error %d\n", ret);
+    guc_xfer_rsa(guc, vma);
      ret = guc_xfer_ucode(guc, vma);
-    if (ret)
-            DRM_WARN("GuC firmware code xfer error %d\n", ret);
-
-    ret = guc_wait_ucode(guc);
-    if (ret)
-            DRM_ERROR("GuC firmware xfer error %d\n", ret);

With the logs gone we don't have any more error message to understand
where exactly we hit issue (xfer vs wait_ucode). We do have debug logs
printing the status that would give the info, but might be worth
retaining at least 1 error-level log.

+1 for leaving error-level logs (here or at helper functions)
also, please let CI re-run this patch with GuC enabled ;)
 I disagree since the errors are noise as the error is propagated and
handled by the caller -- so not strictly errors, but maybe
DRM_DEBUG_DRIVER().
-Chris


True, we don't want an error message if the caller decides that a  
failure can be expected and retries. we already have DRM_DEBUG_DRIVER  
printing status inside the GuC wait and GuC xfer functions, those should  
be enough.

Daniele
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx