On 7/19/2022 02:42, Tvrtko Ursulin wrote:
On 19/07/2022 01:05, John Harrison wrote:
On 7/18/2022 05:15, Tvrtko Ursulin wrote:
On 13/07/2022 00:31, John.C.Harrison@xxxxxxxxx wrote:
From: Matthew Brost <matthew.brost@xxxxxxxxx>
Remove bogus GEM_BUG_ON which compared kernel context timeline
seqno to
seqno in memory on engine PM unpark. If a GT reset occurred these
values
might not match as a kernel context could be skipped. This bug was
hidden by always switching to a kernel context on park (execlists
requirement).
Reset of the kernel context? Under which circumstances does that
happen?
As per description, the issue is with full GT reset.
It is unclear if the claim is this to be a general problem or the
assert is only invalid with the GuC. Lack of a CI reported issue
suggests it is not a generic problem?
Currently it is not an issue because we always switch to the kernel
context because that's how execlists works and the entire driver is
fundamentally based on execlist operation. When we stop using the
kernel context as a (non-functional) barrier when using GuC
submission, then you would see an issue without this fix.
Issue is with GuC, GuC and full reset, or with full reset regardless
of the backend?
The issue is with code making invalid assumptions. The assumption is
currently not failing because the execlist backend requires the use of a
barrier context for a bunch of operations. The GuC backend does not
require this. In fact, the barrier context does not function as a
barrier when the scheduler is external to i915. Hence the desire to
remove the use of the barrier context from generic i915 operation and
make it only used when in execlist mode. At that point, the invalid
assumption will no longer work and the BUG will fire.
If issue is only with GuC patch should have drm/i915/guc prefix as
minimum. But if it actually only becomes a problem when GuC backend
stops parking with the kernel context when I think the whole unpark
code should be refactored in a cleaner way than just removing the one
assert. Otherwise what is the point of leaving everything else in there?
Or if the issue is backend agnostic, *if* full reset happens to hit
during parking, then it is different. Wouldn't that be a race with
parking and reset which probably shouldn't happen to start with.
The issue is neither with GuC nor with resets, GT or otherwise. The
issue is with generic i915 code making assumptions about backend
implementations that are only correct for the execlist implementation.
John.
Regards,
Tvrtko
John.
Regards,
Tvrtko
Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
---
drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index b0a4a2dbe3ee9..fb3e1599d04ec 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -68,8 +68,6 @@ static int __engine_unpark(struct intel_wakeref *wf)
ce->timeline->seqno,
READ_ONCE(*ce->timeline->hwsp_seqno),
ce->ring->emit);
- GEM_BUG_ON(ce->timeline->seqno !=
- READ_ONCE(*ce->timeline->hwsp_seqno));
}
if (engine->unpark)