On 10/11/2021 16:47, Matthew Brost wrote:
The hangcheck selftest blocks per engine resets by setting magic bits in
the reset flags. This is incorrect for GuC submission because if the GuC
fails to reset an engine we would like to do a full GT reset. Do no set
these magic bits when using GuC submission.
Side note this lockless algorithm with magic bits to block resets really
should be ripped out.
As a first step, I am seeing a pointless BUILD_BUG but no BUILD_BUG at
all for what really needs to be verified. Specifically, in
intel_gt_handle_error, inside the engine reset loop, there is:
BUILD_BUG_ON(I915_RESET_MODESET >=
I915_RESET_ENGINE);
Given that the above two values are explicit #defines of '1' and '2'.
I'm not seeing any value to this assert. On the other hand, what I am
not seeing anywhere is an assert that 'I915_RESET_ENGINE + max_engine_id
< I915_WEDGED_ON_INIT'. That being the thing that would actually go
horribly wrong if the engine count increased too far. Seems like there
should be one of those in intel_engines_init_mmio, using
ARRAY_SIZE(intel_engines) as the max id.
It looks like 'busy-reset' and 'reset-idle' parts of 'igt_ctx_sseu' in
gem/selftests/i915_gem_context.c also mess around with these flags and
then try to issue a manual engine reset. Presumably those tests are also
going to have issues with GuC submission.
The workarounds, mocs and reset selftests also do strange things. Those
ones did get updated to support GuC submission by not attempting manual
engine resets but using the GuC based hang detection instead. However,
it seems like they would also suffer the same deadlock scenario if the
GuC based reset were to fail.
I'm wondering if the better fix is to remove the use of the
I915_RESET_ENGINE flags completely when using GuC submission. So far as
I can tell, they are only used (outside of selftest hackery) in
intel_gt_handle_error to guard against multiple concurrent resets within
i915. Guarding against multiple concurrent resets in GuC is the GuC's
job. That leaves GuC based engine reset concurrent with i915 based full
GT reset. But that is fine because the full GT reset toasts all engines
AND the GuC. So it doesn't matter what GuC might or might not have been
doing at the time. The only other issue is multiple concurrent full GT
resets, but that is protected against by I915_RESET_BACKOFF.
So my thought is to add an 'if(!guc_submission)' wrapper around the set
and clear of the reset flags immediately before/after the call to
intel_gt_reset_global().
Fixing it there means the selftests can do what they like with the flags
without causing problems for GuC submission. It also means being one
step closer to removing the dodgy lockless locking completely, at least
when using GuC submission.
John.
Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
---
drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index 7e2d99dd012d..90a03c60c80c 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -734,7 +734,8 @@ static int __igt_reset_engine(struct intel_gt *gt, bool active)
reset_engine_count = i915_reset_engine_count(global, engine);
st_engine_heartbeat_disable(engine);
- set_bit(I915_RESET_ENGINE + id, >->reset.flags);
+ if (!using_guc)
+ set_bit(I915_RESET_ENGINE + id, >->reset.flags);
count = 0;
do {
struct i915_request *rq = NULL;
@@ -824,7 +825,8 @@ static int __igt_reset_engine(struct intel_gt *gt, bool active)
if (err)
break;
} while (time_before(jiffies, end_time));
- clear_bit(I915_RESET_ENGINE + id, >->reset.flags);
+ if (!using_guc)
+ clear_bit(I915_RESET_ENGINE + id, >->reset.flags);
st_engine_heartbeat_enable(engine);
pr_info("%s: Completed %lu %s resets\n",
engine->name, count, active ? "active" : "idle");
@@ -1042,7 +1044,8 @@ static int __igt_reset_engines(struct intel_gt *gt,
yield(); /* start all threads before we begin */
st_engine_heartbeat_disable_no_pm(engine);
- set_bit(I915_RESET_ENGINE + id, >->reset.flags);
+ if (!using_guc)
+ set_bit(I915_RESET_ENGINE + id, >->reset.flags);
do {
struct i915_request *rq = NULL;
struct intel_selftest_saved_policy saved;
@@ -1165,7 +1168,8 @@ static int __igt_reset_engines(struct intel_gt *gt,
if (err)
break;
} while (time_before(jiffies, end_time));
- clear_bit(I915_RESET_ENGINE + id, >->reset.flags);
+ if (!using_guc)
+ clear_bit(I915_RESET_ENGINE + id, >->reset.flags);
st_engine_heartbeat_enable_no_pm(engine);
pr_info("i915_reset_engine(%s:%s): %lu resets\n",