On 9/23/21 4:33 PM, Tvrtko Ursulin wrote:
On 23/09/2021 14:19, Thomas Hellström wrote:
On 9/23/21 2:59 PM, Tvrtko Ursulin wrote:
On 23/09/2021 12:47, Thomas Hellström wrote:
Hi, Tvrtko,
On 9/23/21 12:13 PM, Tvrtko Ursulin wrote:
On 22/09/2021 07:25, Thomas Hellström wrote:
With GuC submission on DG1, the execution of the requests times out
for the gem_exec_suspend igt test case after executing around
800-900
of 1000 submitted requests.
Given the time we allow elsewhere for fences to signal (in the
order of
seconds), increase the timeout before we mark the gt wedged and
proceed.
I suspect it is not about requests not retiring in time but about
the intel_guc_wait_for_idle part of intel_gt_wait_for_idle.
Although I don't know which G2H message is the code waiting for at
suspend time so perhaps something to run past the GuC experts.
So what's happening here is that the tests submits 1000 requests,
each writing a value to an object, and then that object content is
checked after resume. With GuC it turns out that only 800-900 or so
values are actually written before we time out, and the test
(basic-S3) fails, but not on every run.
Yes and that did not make sense to me. It is a single context even
so I did not come up with an explanation why would GuC be slower.
Unless it somehow manages to not even update the ring tail in time
and requests are still only stuck in the software queue? Perhaps you
can see that from context tail and head when it happens.
This is a bit interesting in itself, because I never saw the
hang-S3 test fail, which from what I can tell basically is an
identical test but with a spinner submitted after the 1000th
request. Could be that the suspend backup code ends up waiting for
something before we end up in intel_gt_wait_for_idle, giving more
requests time to execute.
No idea, I don't know the suspend paths that well. For instance
before looking at the code I thought we would preempt what's
executing and not wait for everything that has been submitted to
finish. :)
Anyway, if that turns out to be correct then perhaps it would be
better to split the two timeouts (like if required GuC timeout is
perhaps fundamentally independent) so it's clear who needs how
much time. Adding Matt and John to comment.
You mean we have separate timeouts depending on whether we're using
GuC or execlists submission?
No, I don't know yet. First I think we need to figure out what
exactly is happening.
Well then TBH I will need to file a separate Jira about that. There
might be various things going on here like swiching between the
migrate context for eviction of unrelated LMEM buffers and the
context used by gem_exec_suspend. The gem_exec_suspend failures are
blocking DG1 BAT so it's pretty urgent to get this series merged. If
you insist I can leave this patch out for now, but rather I'd commit
it as is and File a Jira instead.
I see now how you have i915_gem_suspend() in between two
lmem_suspend() calls in this series. So first call has the potential
of creating a lot of requests and that you think interferes? Sounds
plausible but implies GuC timeslicing is less efficient if I follow?
Yes, I guess so. Not sure exactly what is not performing so well with
the GuC but some tests really take a big performance hit, like
gem_lmem_swapping and gem_exec_whisper, but those may trigger entirely
different situations than what we have here.
IMO it is okay to leave for follow up work but strictly speaking,
unless I am missing something, the approach of bumping the timeout
does not sound valid if the copying is done async.
Not async ATM. In any case It will probably make sense to sync before we
start the GT timeout, so that remaining work can be done undisturbed by
the copying. That way copying will always succeed, but depending on how
much and what type of work user-space has queued up, it might be terminated.
Because the timeout is then mandated not only as function of GPU
activity (lets say user controlled), but also the amount of
unpinned/idle buffers which happen to be laying around (which is more
i915 controlled, or mixed at least).
So question is, with enough data to copy, any timeout could be too low
and then how long do we want to wait before failing suspend? Is this
an argument to have a separate timeout specifically addressing the
suspend path or not I am not sure. Perhaps there is no choice and
simply wait until buffers are swapped out otherwise nothing will work.
Regards,
Tvrtko
Thanks,
Thomas.