On 2/22/2023 1:01 PM, Alan Previn wrote:
The Driver-FLR flow may inadvertently exit early before the full
completion of the re-init of the internal HW state if we only poll
GU_DEBUG Bit31 (polling for it to toggle from 0 -> 1). Instead
we need a two-step completion wait-for-completion flow that also
involves GU_CNTL. See the patch and new code comments for detail.
This is new direction from HW architecture folks.
v2: - Add error message for the teardown timeout (Anshuman)
- Don't duplicate code in comments (Jani)
Signed-off-by: Alan Previn <alan.previn.teres.alexis@xxxxxxxxx>
Fixes: 5a44fcd73498 ("drm/i915/gsc: Do a driver-FLR on unload if GSC was loaded")
I'm not sure if we need a fixes tag, given that this is MTL specific
code and that's still under force probe.
---
drivers/gpu/drm/i915/intel_uncore.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index f018da7ebaac..f3c46352db89 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -2749,14 +2749,25 @@ static void driver_initiated_flr(struct intel_uncore *uncore)
/* Trigger the actual Driver-FLR */
intel_uncore_rmw_fw(uncore, GU_CNTL, 0, DRIVERFLR);
+ /* Wait for hardware teardown to complete */
+ ret = intel_wait_for_register_fw(uncore, GU_CNTL,
+ DRIVERFLR_STATUS, 0,
shouldn't this bit be DRIVERFLR instead of DRIVERFLR_STATUS ? I know
they're both BIT(31), but DRIVERFLR_STATUS is the definition for the
GU_DEBUG bit, while this wait is on GU_CNTL.
+ flr_timeout_ms);
+ if (ret) {
+ drm_err(&i915->drm, "Driver-FLR-teardown wait completion failed! %d\n", ret);
+ return;
+ }
+
+ /* Wait for hardware/firmware re-init to complete */
ret = intel_wait_for_register_fw(uncore, GU_DEBUG,
DRIVERFLR_STATUS, DRIVERFLR_STATUS,
flr_timeout_ms);
I was wondering if we could reduce the timing here to avoid 2 waits of 3
seconds, as the 3 seconds should be for the full process. However, the
specs don't say how much each step can take, so I agree that to be safe
is better to have both timeouts at 3 seconds. If the FLR fails the HW is
toast anyway, so waiting a few seconds more to detect it on driver
unload is not going to have additional consequences that we wouldn't
already have.
With the bit in the wait above fixed:
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@xxxxxxxxx>
Daniele
if (ret) {
- drm_err(&i915->drm, "wait for Driver-FLR completion failed! %d\n", ret);
+ drm_err(&i915->drm, "Driver-FLR-reinit wait completion failed! %d\n", ret);
return;
}
+ /* Clear sticky completion status */
intel_uncore_write_fw(uncore, GU_DEBUG, DRIVERFLR_STATUS);
}