On 11/01/2021 21:54, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2021-01-11 17:12:57)
On 11/01/2021 16:27, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2021-01-11 16:19:40)
On 11/01/2021 10:57, Chris Wilson wrote:
During igt_reset_nop_engine, it was observed that an unexpected failed
engine reset lead to us busywaiting on the stop-ring semaphore (set
during the reset preparations) on the first request afterwards. There was
no explicit MI_ARB_CHECK in this sequence as the presumption was that
the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
It did not in this circumstance, so force it.
In other words MI_SEMAPHORE_POLL is not a preemption point? Can't
remember if I knew that or not..
MI_SEMAPHORE_WAIT | POLL is most definitely a preemption point on a
miss.
1)
Why not the same handling in !gen12 version?
Because I think it's a bug in tgl [a0 at least]. I think I've seen the
same symptoms on tgl before, but not earlier. This is the first time the
sequence clicked as to why it was busy spinning. Random engine reset
failures are rare enough -- I was meant to also write a test case to
inject failure.
Random engine reset failure you think is a TGL issue?
The MI_SEMAPHORE_WAIT | POLL miss not generating an arbitration point.
We have quite a few selftests and IGT that use this feature.
So I was wondering if this was similar to one of those tgl issues with
semaphores and CS events.
The random engine reset failure here is also decidedly odd. The engine
was idle!
2)
Failed reset leads to busy-hang in following request _tail_? But there
is an arb check at the start of following request as well. Or in cases
where we context switch into the middle of a previously executing request?
It was the first request submitted after the failed reset. We expect to
clear the ring-stop flag on the CS IDLE->ACTIVE event.
But why would that busy hang? Hasn't the failed request unpaused the ring?
The engine was idle at the time of the failed reset. We left the
ring-stop set, and submitted the next batch of requests. We hit the
MI_SEMAPHORE_WAIT(ring-stop) at the end of the first request, but
without hitting an arbitration point (first request, no init-breadcrumb
in this case), the semaphore was stuck.
So a kernel context request?
Ish. The selftest is using empty requests, and not emitting the
initial breadcrumb. (So acting like a kernel context.)
Why hasn't IDLE->ACTIVE cleared ring stop?
There hasn't been an idle->active event, not a single CS event after
writing to ELSP and timing out while still spinning on the semaphore.
Presumably this CSB must come after the first request has been submitted
so apparently I am still not getting how it hangs.
It was never sent. The context is still in pending[0] (not active[0])
and there's no sign in the trace of any interrupts/tasklet handing other
than the semaphore-wait interrupt.
Just because igt_reset_nop_engine does things "quickly"? It prevents the
CSB from arriving?
More that the since we do very little we hit the semaphore before the CS
has recovered from the shock of being asked to do something.
So ARB_CHECK pickups up on the fact ELSP has been
rewritten before the IDLE->ACTIVE even received and/or engine reset
prevented it from arriving?
The ARB_CHECK should trigger the CS to generate the IDLE->ACTIVE event.
(Of course assuming that the bug is in the semaphore not triggering the
event due to strange circumstances and not a bug in the event generator
itself.) I'm suspicious of the semaphore due to the earlier CS bugs with
lite-restores + semaphores, and am expecting that since the MI_ARB_CHECK
is explicit, it actually works.
Okay got it, thanks. I suggest it would be good to slightly improve the
commit message so it is clear what are the suspected TGL quirks. But in
general:
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx