Re: [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Wed, 27 Oct 2021 22:47:20 +0200

On 10/27/21 22:34, John Harrison wrote:
On 10/26/2021 23:36, Thomas Hellström wrote:
Hi, John,

On 10/26/21 21:55, John Harrison wrote:
On 10/21/2021 23:23, Thomas Hellström wrote:
On 10/21/21 22:37, Matthew Brost wrote:
On Thu, Oct 21, 2021 at 08:15:49AM +0200, Thomas Hellström wrote:
Hi, Matthew,

On Mon, 2021-10-11 at 16:47 -0700, Matthew Brost wrote:
The hangcheck selftest blocks per engine resets by setting magic 
bits
in
the reset flags. This is incorrect for GuC submission because if 
the
GuC
fails to reset an engine we would like to do a full GT reset. Do no
set
these magic bits when using GuC submission.

Side note this lockless algorithm with magic bits to block resets
really
should be ripped out.

Lockless algorithm aside, from a quick look at the code in
intel_reset.c it appears to me like the interface that falls back 
to a
full GT reset is intel_gt_handle_error() whereas 
intel_engine_reset()
is explicitly intended to not do that, so is there a discrepancy
between GuC and non-GuC here?

With GuC submission when an engine reset fails, we get an engine 
reset
failure notification which triggers a full GT reset
(intel_guc_engine_failure_process_msg in intel_guc_submission.c). 
That
reset is blocking by setting these magic bits. Clearing the bits 
in this
function doesn't seem to unblock that reset either, the driver 
tries to
unload with a worker blocked, and results in the blow up. 
Something with
this lockless algorithm could be wrong as clear of the bit should
unlblock the reset but it is doesn't. We can look into that but in 
the
meantime we need to fix this test to be able to fail gracefully 
and not
crash CI.

Yeah, for that lockless algorithm if needed, we might want to use a 
ww_mutex per engine or something,
but point was that AFAICT at least one of the tests that set those 
flags explicitly tested the functionality that no other engines 
than the intended one was reset when the intel_engine_reset() 
function was used, and then if GuC submission doesn't honor that, 
wouldn't a better approach be to make a code comment around 
intel_engine_reset() to explain the differences and disable that 
particular test for GuC?. Also wouldn't we for example we see a 
duplicated full GT reset with GuC if intel_engine_reset() fails as 
part of the intel_gt_handle_error() function?
Re-reading this thread, I think there is a misunderstanding.

The selftests themselves have already been updated to support GuC 
based engine resets. That is done by submitting a hanging context 
and letting the GuC detect the hang and issue a reset. There is no 
mechanism available for i915 to directly issue or request an engine 
based reset (because i915 does not know what is running on any given 
engine at any given time, being disconnected from the scheduler).

So the tests are already correctly testing per engine resets and do 
not go anywhere near either intel_engine_reset() or 
intel_gt_handle_error() when GuC submission is used. The problem is 
what happens if the engine reset fails (which supposedly can only 
happen with broken hardware). In that scenario, there is an 
asynchronous message from GuC to i915 to notify us of the failure. 
The KMD receives that notification and then (eventually) calls 
intel_gt_handle_error() to issue a full GT reset. However, that is 
blocked because the selftest is not expecting it and has vetoed the 
possibility.

This is where my understanding of the discussion differs. According 
to Matthew, the selftest actually proceeds to clear the bits, but the 
worker that calls into intel_gt_handle_error() never wakes up. (and 
that's probably due to clear_bit() being used instead of 
clear_and_wake_up_bit()).
Hmm, missed that point. Yeah, sounds like the missing wake_up suffix 
is what is causing the deadlock. I can't see any other reason why the 
reset handler would not proceed once the flags are cleared. And it 
looks like the selftest should timeout out waiting for the request and 
continue on to clear the bits just fine.

And my problem with this particular patch is that it adds even more 
"if (!guc_submission)" which is already sprinkled all over the place 
in the selftests to the point that it becomes difficult to see what 
(if anything) the tests are really testing.
I agree with this. Fixing the problem at source seems like a better 
solution than hacking lots of different bits in different tests.

OK, so if we can fix this in intel_gt_handle_error() that'd be great.

Thanks,

Thomas