Re: [PATCH] drm/i915/guc: do not capture error state on exiting context

Andrzej Hajda <andrzej.hajda@xxxxxxxxx> · Thu, 29 Sep 2022 11:49:46 +0200

On 29.09.2022 10:22, Tvrtko Ursulin wrote:

On 28/09/2022 19:27, John Harrison wrote:
On 9/28/2022 00:19, Tvrtko Ursulin wrote:
On 27/09/2022 22:36, Ceraolo Spurio, Daniele wrote:
On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
On 27/09/2022 07:49, Andrzej Hajda wrote:
On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
On 9/26/2022 3:44 PM, Andi Shyti wrote:
Hi Andrzej,

On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
Capturing error state is time consuming (up to 350ms on DG2), 
so it should
be avoided if possible. Context reset triggered by context 
removal is a
good example.
With this patch multiple igt tests will not timeout and should 
run faster.

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
Signed-off-by: Andrzej Hajda <andrzej.hajda@xxxxxxxxx>
fine for me:

Reviewed-by: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx>

Just to be on the safe side, can we also have the ack from any of
the GuC folks? Daniele, John?

Andi


---
  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 22ba66e48a9b01..cb58029208afe1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4425,7 +4425,8 @@ static void 
guc_handle_context_reset(struct intel_guc *guc,
      trace_intel_context_reset(ce);
        if (likely(!intel_context_is_banned(ce))) {
-        capture_error_state(guc, ce);
+        if (!intel_context_is_exiting(ce))
+            capture_error_state(guc, ce);

I am not sure here - if we have a persistent context which caused a 
GPU hang I'd expect we'd still want error capture.

What causes the reset in the affected IGTs? Always preemption timeout?

guc_context_replay(ce);

You definitely don't want to replay requests of a context that is 
going away.

My intention was to just avoid error capture, but that's even 
better, only condition change:
-        if (likely(!intel_context_is_banned(ce))) {
+       if (likely(intel_context_is_schedulable(ce)))  {

Yes that helper was intended to be used for contexts which should 
not be scheduled post exit or ban.

Daniele - you say there are some misses in the GuC backend. Should 
most, or even all in intel_guc_submission.c be converted to use 
intel_context_is_schedulable? My idea indeed was that "ban" should 
be a level up from the backends. Backend should only distinguish 
between "should I run this or not", and not the reason.

I think that all of them should be updated, but I'd like Matt B to 
confirm as he's more familiar with the code than me.

Right, that sounds plausible to me as well.

One thing I forgot to mention - the only place where backend can care 
between "schedulable" and "banned" is when it picks the preempt 
timeout for non-schedulable contexts. This is to only apply the 
strict 1ms to banned (so bad or naught contexts), while the ones 
which are exiting cleanly get the full preempt timeout as otherwise 
configured. This solves the ugly user experience quirk where GPU 
resets/errors were logged upon exit/Ctrl-C of a well behaving 
application (using non-persistent contexts). Hopefully GuC can match 
that behaviour so customers stay happy.

Regards,

Tvrtko

The whole revoke vs ban thing seems broken to me.

First of all, if the user hits Ctrl+C we need to kill the context off 
immediately. That is a fundamental customer requirement. Render and 
compute engines have a 7.5s pre-emption timeout. The user should not 
have to wait 7.5s for a context to be removed from the system when 
they have explicitly killed it themselves. Even the regular timeout of 
640ms is borderline a long time to wait. And note that there is an 
ongoing request/requirement to increase that to 1900ms.

Under what circumstances would a user expect anything sensible to 
happen after a Ctrl+C in terms of things finishing their rendering and 
display nice pretty images? They killed the app. They want it dead. We 
should be getting it off the hardware as quickly as possible. If you 
are really concerned about resets causing collateral damage then maybe 
bump the termination timeout from 1ms up to 10ms, maybe at most 100ms. 
If an app is 'well behaved' then it should cleanly exit within 10ms. 
But if it is bad (which is almost certainly the case if the user is 
manually and explicitly killing it) then it needs to be killed because 
it is not going to gracefully exit.

Right.. I had it like that initially (lower timeout - I think 20ms or 
so, patch history on the mailing list would know for sure), but then 
simplified it after review feedback to avoid adding another timeout value.

So it's not at all about any expectation that something should actually 
finish to any sort of completion/success. It is primarily about not 
logging an error message when there is no error. Thing to keep in mind 
is that error messages are a big deal in some cultures. In addition to 
that, avoiding needless engine resets is a good thing as well.

Previously the execlists backend was over eager and only allowed for 1ms 
for such contexts to exit. If the context was banned sure - that means 
it was a bad context which was causing many hangs already. But if the 
context was a clean one I argue there is no point in doing an engine reset.

So if you want, I think it is okay to re-introduce a secondary timeout.

Or if you have an idea on how to avoid the error messages / GPU resets 
when "friendly" contexts exit in some other way, that is also something 
to discuss.

Secondly, the whole persistence thing is a total mess, completely 
broken and intended to be massively simplified. See the internal task 
for it. In short, the plan is that all contexts will be immediately 
killed when the last DRM file handle is closed. Persistence is only 
valid between the time the per context file handle is closed and the 
time the master DRM handle is closed. Whereas, non-persistent contexts 
get killed as soon as the per context handle is closed. There is 
absolutely no connection to heartbeats or other irrelevant operations.

The change we are discussing is not about persistence, but for the 
persistence itself - I am not sure it is completely broken and if, or 
when, the internal task will result with anything being attempted. In 
the meantime we had unhappy customers for more than a year. So do we 
tell them "please wait for a few years more until some internal task 
with no clear timeline or anyone assigned maybe gets looked at"?

So in my view, the best option is to revert the ban vs revoke patch. 
It is creating bugs. It is making persistence more complex not 
simpler. It harms the user experience.

I am not aware of the bugs, even less so that it is harming user 
experience!?

Bugs are limited to the GuC backend or in general? My CI runs were clean 
so maybe test cases are lacking. Is it just a case of 
s/intel_context_is_banned/intel_context_is_schedulable/ in there to fix it?

Again, the change was not about persistence. It is the opposite - 
allowing non-persistent contexts to exit cleanly.

If the original problem was simply that error captures were being done 
on Ctrl+C then the fix is simple. Don't capture for a banned context. 
There is no need for all the rest of the revoke patch.

Error capture was not part of the original story so it may be a 
completely orthogonal topic that we are discussing it in this thread.

Wouldn't be good then to separate these two issues: 
banned/exiting/schedulable handling and error capturing of exiting context.
This patch handles only the latter, and as I understand there is no big 
controversy that we de not need capture errors for exiting contexts.
If yes, can we ack/merge this patch, to make CI happy and continue 
discussion on the former.

Regards
Andrzej



Regards,

Tvrtko