Re: ✗ Fi.CI.BAT: failure for drm/i915: ttm for stolen (rev5)

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Fri, 1 Jul 2022 08:34:27 +0100

On 30/06/2022 15:20, Robert Beckett wrote:

On 29/06/2022 13:51, Robert Beckett wrote:

On 28/06/2022 17:22, Robert Beckett wrote:

On 28/06/2022 09:46, Tvrtko Ursulin wrote:

On 27/06/2022 18:08, Robert Beckett wrote:

On 22/06/2022 10:05, Tvrtko Ursulin wrote:

On 21/06/2022 20:11, Robert Beckett wrote:

On 21/06/2022 18:37, Patchwork wrote:
*Patch Details*
*Series:*    drm/i915: ttm for stolen (rev5)
*URL:*    https://patchwork.freedesktop.org/series/101396/ 
<https://patchwork.freedesktop.org/series/101396/>
*State:*    failure
*Details:* 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html 
<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html> 

  CI Bug Log - changes from CI_DRM_11790 -> Patchwork_101396v5

    Summary

*FAILURE*

Serious unknown changes coming with Patchwork_101396v5 
absolutely need to be
verified manually.

If you think the reported changes have nothing to do with the 
changes
introduced in Patchwork_101396v5, please notify your bug team to 
allow them
to document this new failure mode, which will reduce false 
positives in CI.

External URL: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html 

    Participating hosts (40 -> 41)

Additional (2): fi-icl-u2 bat-dg2-9
Missing (1): fi-bdw-samus

    Possible new issues

Here are the unknown changes that may have been introduced in 
Patchwork_101396v5:

      IGT changes

        Possible regressions

  * igt@i915_selftest@live@reset:
      o bat-adlp-4: PASS
<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@live@xxxxxxxxxx> 

        -> DMESG-FAIL
<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@live@xxxxxxxxxx> 

I keep hitting clobbered pages during engine resets on bat-adlp-4.
It seems to happen most of the time on that machine and 
occasionally on bat-adlp-6.

Should bat-adlp-4 be considered an unreliable machine like 
bat-adlp-6 is for now?

Alternatively, seeing the history of this in

commit 3da3c5c1c9825c24168f27b021339e90af37e969 "drm/i915: 
Exclude low pages (128KiB) of stolen from use"

could this be an indication that maybe the original issue is 
worse on adlp machines?
I have only ever seen page page 135 or 136 clobbered across many 
runs via trybot, so it looks fairly consistent.
Though excluding the use of over 540K of stolen might be too severe.

Don't know but I see that on the latest version you even hit pages 
165/166.

Any history of hitting this in CI without your series? If not, are 
there some other changes which could explain it? Are you touching 
the selftest itself?

Hexdump of the clobbered page looks quite complex. Especially 
POISON_FREE. Any idea how that ends up there?

(see 
https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v4/fi-rkl-guc/igt@i915_selftest@live@xxxxxxxxxx#dmesg-warnings702) 

after lots of slow debug via CI, it looks like the issue is that a 
ring buffer was allocated and taking up that page during the 
initial crc capture in the test, but by the time it came to check 
for corruption, it had been freed from that page.

The test has a number of weaknesses:

1. the busy check is done twice, without taking in to account any 
change in between. I assume previously this could be relied on 
never to occur, but now it can for some reason (more on that later)

You mean the stolen page used/unused test? Probably the premise is 
that the test controls the driver completely ie. is the sole user 
and the two checks are run at the time where nothing else could have 
changed the state.

With the nerfed request (as with GuC) this actually should hold. In 
the generic case I am less sure, my working knowledge faded a bit, 
but perhaps there was something guaranteeing the spinner couldn't 
have been retired yet at the time of the second check. Would need 
clarifying at least in comments.

2. the engine reset returns early with an error for guc submission 
engines, but it is silently ignored in the test. Perhaps it should 
ignore guc submission engines as it is a largely useless test for 
those situations.

Yes looks dodgy indeed. You will need to summon the owners of the 
GuC backend to comment on this.

However even if the test should be skipped with GuC it is extremely 
interesting that you are hitting this so I suspect there is a more 
serious issue at play.

indeed. That's why I am keen to get to the root cause instead of just 
slapping in a fix.

A quick obvious fix is to have a busy bitmask that remembers each 
page's busy state initially and only check for corruption if it was 
busy during both checks.

However, the main question is why this is occurring now with my 
changes.
I have added more debug to check where the stolen memory is being 
freed, but the first run last night didn't hit the issue for once.
I am running again now, will report back if I figure out where it 
is being freed.

I am pretty sure the "corruption" (which isn't actually corruption) 
is from a ring buffer.
The POISON_FREE is the only difference between the captured before 
and after dumps:

[0040] 00000000 02800000 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 
6b6b6b6b 6b6b6b6b

with the 2nd dword being the MI_ARB_CHECK used for the spinner.
I think this is the request poisoning from i915_request_retire()

The bit I don't know yet is why a ring buffer was freed between the 
initial crc capture and the corruption check. The spinner should be 
active across the entire test, maintaining a ref on the context and 
it's ring.

hopefully my latest debug will give more answers.

Yeah if you can figure our whether the a) spinner is still active 
during the 2nd check (as I think it should be), and b) is the 
corruption detected in the same pages which were used in the 1st 
pass that would be interesting.

yep. The latest run is still stuck in the CI queue after 27 hours.
I think I have enough debug in there to catch it now.
Hopefully I can get a root cause once it gets chance to run.

https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v7/fi-adl-ddr5/igt@i915_selftest@live@xxxxxxxxxx#dmesg-warnings496 

well, the run finally happened.
And it shows that the freed resource happens from a workqueue. Not 
helpful.

I'll now add a saved stack traces to all objects that saves where it 
is allocated and freed/queued for free.

https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v8/fi-rkl-guc/igt@i915_selftest@live@xxxxxxxxxx#dmesg-warnings419 

I'm pretty sure I know what is going on now.

igt_reset_engines_stolen() loops around each engine and calls 
__igt_reset_stolen() for that engine.

__igt_reset_stolen() does
intel_context_create()

igt_spinner_create_request()->intel_context_create_request()->__i915_request_create()->intel_context_get() 

intel_context_put()

leaving the request as the remaining holder of the context.

it then does the reset, which does nothing on GuC setups, does the 
comparisons, then ends the spinner via 
igt_spinner_fini()->igt_spinner_end()
which lets the spinner request finish.

once the request is retired, intel_context_put() is eventually called, 
which starts the GuC teardown of the context as the request was the last 
holder of the context.

This GuC teardown is asynchronous via ct transactions.
By the time the ct_process_request() sees the 
INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE message, the test has already 
looped to the next engine and has already checked the original status of 
the page that the destroying context used for its ring buffer, so the 
test sees it being freed from the previous loop while testing the next 
engine. It considers this a corruption of the stolen memory due to the 
previously highlighted double checking of busy state for each page.

Alright, makes sense. Test kind of depends on implementation details and 
perhaps ideally it should do some explicit flushing before moving to the 
next engine, instead of assuming engine reset leaves everything idle and 
flushed. I don't know from the top of my head what kind of flushing 
would that be. In theory all possible delayed workers that we have. 
Maybe. But never mind for now.

I think for now, we should simply not test GuC submission engines in 
line with the reset call returning an error.
If at some point we want to enable this test for GuC setups, then 
flushing and waiting for context cleanup would need to be added to the 
test.

Yeah that's okay. Alternative could be to provoke for instance the 
preempt timeout and hit the reset in that way. I am pretty sure some 
tests fiddle with it to enable rapid execution time. (Remove 
MI_ARB_CEHCK from the spinner and send an enging pulse, then wait for 
reset.)

Anyone know why per engine reset is not allowed for GuC submission setup?
looking at commit "eb5e7da736f3 drm/i915/guc: Reset implementation for 
new GuC interface" doesn't really detail why per engine resets are not 
allowed.
Maybe it just never got implemented? or are there reasons to not allow 
the host to request specific engine resets?

As Thomas has said - thinking seems to be there must be no explicit 
external mechanism to make GuC trigger the engine reset. Maybe it would 
be useful for testing purposes, or maybe indirect route as above is enough.

Regards,

Tvrtko