Re: [PATCH 3/4] drm/i915/gt: Perform an arbitration check before busywaiting

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Mon, 11 Jan 2021 17:12:57 +0000

On 11/01/2021 16:27, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2021-01-11 16:19:40)

On 11/01/2021 10:57, Chris Wilson wrote:
During igt_reset_nop_engine, it was observed that an unexpected failed
engine reset lead to us busywaiting on the stop-ring semaphore (set
during the reset preparations) on the first request afterwards. There was
no explicit MI_ARB_CHECK in this sequence as the presumption was that
the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
It did not in this circumstance, so force it.

In other words MI_SEMAPHORE_POLL is not a preemption point? Can't
remember if I knew that or not..

MI_SEMAPHORE_WAIT | POLL is most definitely a preemption point on a
miss.

1)
Why not the same handling in !gen12 version?

Because I think it's a bug in tgl [a0 at least]. I think I've seen the
same symptoms on tgl before, but not earlier. This is the first time the
sequence clicked as to why it was busy spinning. Random engine reset
failures are rare enough -- I was meant to also write a test case to
inject failure.

Random engine reset failure you think is a TGL issue?

2)
Failed reset leads to busy-hang in following request _tail_? But there
is an arb check at the start of following request as well. Or in cases
where we context switch into the middle of a previously executing request?

It was the first request submitted after the failed reset. We expect to
clear the ring-stop flag on the CS IDLE->ACTIVE event.

But why would that busy hang? Hasn't the failed request unpaused the ring?

The engine was idle at the time of the failed reset. We left the
ring-stop set, and submitted the next batch of requests. We hit the
MI_SEMAPHORE_WAIT(ring-stop) at the end of the first request, but
without hitting an arbitration point (first request, no init-breadcrumb
in this case), the semaphore was stuck.

So a kernel context request? Why hasn't IDLE->ACTIVE cleared ring stop? 
Presumably this CSB must come after the first request has been submitted 
so apparently I am still not getting how it hangs.

Just because igt_reset_nop_engine does things "quickly"? It prevents the 
CSB from arriving? So ARB_CHECK pickups up on the fact ELSP has been 
rewritten before the IDLE->ACTIVE even received and/or engine reset 
prevented it from arriving?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx