On 9/27/2022 3:14 AM, Andrzej Hajda wrote:
On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
On 9/26/2022 3:44 PM, Andi Shyti wrote:
Hi Andrzej,
On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
Capturing error state is time consuming (up to 350ms on DG2), so it
should
be avoided if possible. Context reset triggered by context removal
is a
good example.
With this patch multiple igt tests will not timeout and should run
faster.
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
Signed-off-by: Andrzej Hajda <andrzej.hajda@xxxxxxxxx>
fine for me:
Reviewed-by: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx>
Just to be on the safe side, can we also have the ack from any of
the GuC folks? Daniele, John?
Andi
---
drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 22ba66e48a9b01..cb58029208afe1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4425,7 +4425,8 @@ static void guc_handle_context_reset(struct
intel_guc *guc,
trace_intel_context_reset(ce);
if (likely(!intel_context_is_banned(ce))) {
- capture_error_state(guc, ce);
+ if (!intel_context_is_exiting(ce))
+ capture_error_state(guc, ce);
guc_context_replay(ce);
You definitely don't want to replay requests of a context that is
going away.
Without guc_context_replay I see timeouts. Probably because
guc_context_replay calls __guc_reset_context. I am not sure if there
is need to dig deeper, stay with my initial proposition, or sth like:
if (likely(!intel_context_is_banned(ce))) {
if (!intel_context_is_exiting(ce)) {
capture_error_state(guc, ce);
guc_context_replay(ce);
} else {
__guc_reset_context(ce, ce->engine->mask);
}
} else {
The latter is also working.
This seems to be an issue with the context close path when hangcheck is
disabled. In that case we don't call the revoke() helper, so we're not
clearing the context state in the guc backend and therefore we require
__guc_reset_context() in the reset handler to do so. I'd argue that the
proper solution would be to ban the context on close in the hangcheck
disabled scenario and not just rely on the pulse, which btw I'm not sure
works with GuC submission with a preemptable context because the GUC
will just schedule the context back in unless we send an H2G to
explicitly disable it. Not sure why we're not banning right now though,
so I'd prefer if someone knowledgeable could chime in in case there is a
good reason for it.
Daniele
Regards
Andrzej
This seems at least in part due to
https://patchwork.freedesktop.org/patch/487531/, where we replaced
the "context_ban" with "context_exiting". There are several places
where we skipped operations if the context was banned (here included)
which are now not covered anymore for exiting contexts. Maybe we need
a new checker function to check both flags in places where we don't
care why the context is being removed (ban vs exiting), just that it is?
Daniele
} else {
drm_info(&guc_to_gt(guc)->i915->drm,
--
2.34.1