Hi John,
On 12/5/2023 8:50 PM, John Harrison wrote:
On 12/5/2023 02:39, Nirmoy Das wrote:
Hi John,
On 12/5/2023 10:10 AM, John Harrison wrote:
On 12/5/2023 00:52, Nirmoy Das wrote:
gen8_engine_reset_prepare() can fail when HW fails to set
RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
error as driver will retry.
Convert the log to a trace log for debugging without triggering
unnecessary concerns in CI or for end-users during non-fatal
scenarios.
I strongly disagree with this change. The hardware spec for the
RESET_CTL and GDRST registers are that they will self clear within a
matter of microseconds. If something is so badly wrong with the
hardware that it can't even manage to reset
This message is for reset readiness poll timeout not that the reset
is failed which doesn't sound so serious if the subsequent attempt
managed reset the engine.
Not sure what the distinction is. The reset procedure is poke
RESET_CTL wait for it to clear, poke GDRST and wait for it to clear.
Just because step one is failing rather than step 2 does not mean that
the reset as a whole has not failed.
Note that the purpose of RESET_CTL is to pause a bunch of stuff like
the command streamers to prevent them from issuing new memory requests
while the reset is in progress. If it fails, it likely means that a CS
is refusing to stop. Most probably because it can't reach a stopping
point because it is stuck waiting on a lost memory request or similar.
And the point of stopping further memory requests during reset is that
if the memory channel gets out of sync (because only the GT side is
reset during a GT reset) then that can result in total system failure.
As in potentially even the CPU can no longer get to memory if it is an
integrated platform. So yes, it can be quite a serious failure indeed.
Thanks bspec didn't explain those details. My intention was to
acknowledge that engine reset is a complicated process which why the
driver retries and don't spook CI/user if subsequent reset works but I
get your objection on this.
I couldn't get enough details when this can happen that HW takes very
long time to set the readiness bit.
Is it simply 'taking a long time' or is never clearing at all? If it
is just that the timeout is too short then the proper fix would be to
increase the timeout. But if it is taking seconds or longer or just
never succeeding at all, then something is very bad.
I tried with 10x timeout without any help so I think the CS is stuck
though re-try works. I will try to get more details from HW team on this
issue.
then that is something that very much warrants more than a
completely silent trace event. It most certainly should be flagged
as a failure in CI.
Just because the driver will retry does not mean that this is not a
serious error. And if the first attempt failed, why would a
subsequent attempt succeed?
The patch is not ignoring the failure. If the subsequent attempt
fails then driver load will fail or it will be wedged if that happens
after driver load.
One thing I really hate about our driver is the total lack of
information when something goes wrong during load. The driver wedges
in total silence. There are many error paths that have no reporting at
all. Which means you are left with a totally useless bug report.
Escalating to FLR may have more success, but that is not something
that i915 currently does.
Do we still need to do FLR if a subsequent engine reset failure ?
Assuming that we are talking about modern(ish) platforms, an engine
reset failure would be hit by GuC rather than i915, but that would be
escalated to an i915 based full GT reset. Generally speaking though,
if the engine reset fails the GT reset isn't going to do much better.
It would fix a dead GuC problem but it can't help with memory related
issues. If the full GT reset fails then we are out of escalation
routes as there is no FLR path at present (I think we have that at
driver unload on MTL but not for general reset?). The FLR resets a lot
more than just the GT, so it does have a chance to fix some issues
that a GT reset can't. After driver-level FLR, there is PCI level FLR.
Not sure if that involves a full power down and restart, but if not
then that would be the last escalation possible. A power cycle really
should fix any issues, if it doesn't then it's time to return the
system as being totally dead!
My recollection is that the vast majority of engine reset failures
I've looked at have been completely catastrophic and the system only
recovered after a reboot. I.e. after the card was power cycled. Such
issues were generally caused by bad memory. Once the path to memory
has died, there really is not much of the GPU that can do anything at
all and there isn't much that can be done to recover it.
Thanks,
Nirmoy
John.
Regards,
Nirmoy
John.
v2: Improve commit message(Tvrtko)
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Cc: John Harrison <John.C.Harrison@xxxxxxxxx>
Cc: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx>
Cc: Andrzej Hajda <andrzej.hajda@xxxxxxxxx>
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
Signed-off-by: Nirmoy Das <nirmoy.das@xxxxxxxxx>
Reviewed-by: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx>
Reviewed-by: Andrzej Hajda <andrzej.hajda@xxxxxxxxx>
---
drivers/gpu/drm/i915/gt/intel_reset.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
b/drivers/gpu/drm/i915/gt/intel_reset.c
index d5ed904f355d..e6fbc6202c80 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct
intel_engine_cs *engine)
ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
700, 0, NULL);
if (ret)
- gt_err(engine->gt,
- "%s reset request timed out: {request: %08x,
RESET_CTL: %08x}\n",
- engine->name, request,
- intel_uncore_read_fw(uncore, reg));
+ GT_TRACE(engine->gt,
+ "%s reset request timed out: {request: %08x,
RESET_CTL: %08x}\n",
+ engine->name, request,
+ intel_uncore_read_fw(uncore, reg));
return ret;
}