Re: [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/11/2021 16:47, Matthew Brost wrote:
The hangcheck selftest blocks per engine resets by setting magic bits in
the reset flags. This is incorrect for GuC submission because if the GuC
fails to reset an engine we would like to do a full GT reset. Do no set
these magic bits when using GuC submission.

Side note this lockless algorithm with magic bits to block resets really
should be ripped out.
As a first step, I am seeing a pointless BUILD_BUG but no BUILD_BUG at all for what really needs to be verified. Specifically, in intel_gt_handle_error, inside the engine reset loop, there is:                         BUILD_BUG_ON(I915_RESET_MODESET >= I915_RESET_ENGINE);

Given that the above two values are explicit #defines of '1' and '2'. I'm not seeing any value to this assert. On the other hand, what I am not seeing anywhere is an assert that 'I915_RESET_ENGINE + max_engine_id < I915_WEDGED_ON_INIT'. That being the thing that would actually go horribly wrong if the engine count increased too far. Seems like there should be one of those in intel_engines_init_mmio, using ARRAY_SIZE(intel_engines) as the max id.


It looks like 'busy-reset' and 'reset-idle' parts of 'igt_ctx_sseu' in gem/selftests/i915_gem_context.c also mess around with these flags and then try to issue a manual engine reset. Presumably those tests are also going to have issues with GuC submission.

The workarounds, mocs and reset selftests also do strange things. Those ones did get updated to support GuC submission by not attempting manual engine resets but using the GuC based hang detection instead. However, it seems like they would also suffer the same deadlock scenario if the GuC based reset were to fail.

I'm wondering if the better fix is to remove the use of the I915_RESET_ENGINE flags completely when using GuC submission. So far as I can tell, they are only used (outside of selftest hackery) in intel_gt_handle_error to guard against multiple concurrent resets within i915. Guarding against multiple concurrent resets in GuC is the GuC's job. That leaves GuC based engine reset concurrent with i915 based full GT reset. But that is fine because the full GT reset toasts all engines AND the GuC. So it doesn't matter what GuC might or might not have been doing at the time. The only other issue is multiple concurrent full GT resets, but that is protected against by I915_RESET_BACKOFF.

So my thought is to add an 'if(!guc_submission)' wrapper around the set and clear of the reset flags immediately before/after the call to intel_gt_reset_global().

Fixing it there means the selftests can do what they like with the flags without causing problems for GuC submission. It also means being one step closer to removing the dodgy lockless locking completely, at least when using GuC submission.

John.



Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
---
  drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 12 ++++++++----
  1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index 7e2d99dd012d..90a03c60c80c 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -734,7 +734,8 @@ static int __igt_reset_engine(struct intel_gt *gt, bool active)
  		reset_engine_count = i915_reset_engine_count(global, engine);
st_engine_heartbeat_disable(engine);
-		set_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
+		if (!using_guc)
+			set_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
  		count = 0;
  		do {
  			struct i915_request *rq = NULL;
@@ -824,7 +825,8 @@ static int __igt_reset_engine(struct intel_gt *gt, bool active)
  			if (err)
  				break;
  		} while (time_before(jiffies, end_time));
-		clear_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
+		if (!using_guc)
+			clear_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
  		st_engine_heartbeat_enable(engine);
  		pr_info("%s: Completed %lu %s resets\n",
  			engine->name, count, active ? "active" : "idle");
@@ -1042,7 +1044,8 @@ static int __igt_reset_engines(struct intel_gt *gt,
  		yield(); /* start all threads before we begin */
st_engine_heartbeat_disable_no_pm(engine);
-		set_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
+		if (!using_guc)
+			set_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
  		do {
  			struct i915_request *rq = NULL;
  			struct intel_selftest_saved_policy saved;
@@ -1165,7 +1168,8 @@ static int __igt_reset_engines(struct intel_gt *gt,
  			if (err)
  				break;
  		} while (time_before(jiffies, end_time));
-		clear_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
+		if (!using_guc)
+			clear_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
  		st_engine_heartbeat_enable_no_pm(engine);
pr_info("i915_reset_engine(%s:%s): %lu resets\n",




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux