Re: [PATCH] drm/i915/guc: do not capture error state on exiting context

"Ceraolo Spurio, Daniele" <daniele.ceraolospurio@xxxxxxxxx> · Thu, 29 Sep 2022 07:28:05 -0700

On 9/29/2022 3:40 AM, Tvrtko Ursulin wrote:

On 29/09/2022 10:49, Andrzej Hajda wrote:
On 29.09.2022 10:22, Tvrtko Ursulin wrote:
On 28/09/2022 19:27, John Harrison wrote:
On 9/28/2022 00:19, Tvrtko Ursulin wrote:
On 27/09/2022 22:36, Ceraolo Spurio, Daniele wrote:
On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
On 27/09/2022 07:49, Andrzej Hajda wrote:
On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
On 9/26/2022 3:44 PM, Andi Shyti wrote:
Hi Andrzej,

On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
Capturing error state is time consuming (up to 350ms on 
DG2), so it should
be avoided if possible. Context reset triggered by context 
removal is a
good example.
With this patch multiple igt tests will not timeout and 
should run faster.

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
Signed-off-by: Andrzej Hajda <andrzej.hajda@xxxxxxxxx>
fine for me:

Reviewed-by: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx>

Just to be on the safe side, can we also have the ack from 
any of
the GuC folks? Daniele, John?

Andi


---
drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 22ba66e48a9b01..cb58029208afe1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4425,7 +4425,8 @@ static void 
guc_handle_context_reset(struct intel_guc *guc,
      trace_intel_context_reset(ce);
        if (likely(!intel_context_is_banned(ce))) {
-        capture_error_state(guc, ce);
+        if (!intel_context_is_exiting(ce))
+            capture_error_state(guc, ce);

I am not sure here - if we have a persistent context which 
caused a GPU hang I'd expect we'd still want error capture.

What causes the reset in the affected IGTs? Always preemption 
timeout?

guc_context_replay(ce);

You definitely don't want to replay requests of a context that 
is going away.

My intention was to just avoid error capture, but that's even 
better, only condition change:
-        if (likely(!intel_context_is_banned(ce))) {
+       if (likely(intel_context_is_schedulable(ce)))  {

Yes that helper was intended to be used for contexts which 
should not be scheduled post exit or ban.

Daniele - you say there are some misses in the GuC backend. 
Should most, or even all in intel_guc_submission.c be converted 
to use intel_context_is_schedulable? My idea indeed was that 
"ban" should be a level up from the backends. Backend should 
only distinguish between "should I run this or not", and not the 
reason.

I think that all of them should be updated, but I'd like Matt B 
to confirm as he's more familiar with the code than me.

Right, that sounds plausible to me as well.

One thing I forgot to mention - the only place where backend can 
care between "schedulable" and "banned" is when it picks the 
preempt timeout for non-schedulable contexts. This is to only 
apply the strict 1ms to banned (so bad or naught contexts), while 
the ones which are exiting cleanly get the full preempt timeout as 
otherwise configured. This solves the ugly user experience quirk 
where GPU resets/errors were logged upon exit/Ctrl-C of a well 
behaving application (using non-persistent contexts). Hopefully 
GuC can match that behaviour so customers stay happy.

Regards,

Tvrtko

The whole revoke vs ban thing seems broken to me.

First of all, if the user hits Ctrl+C we need to kill the context 
off immediately. That is a fundamental customer requirement. Render 
and compute engines have a 7.5s pre-emption timeout. The user 
should not have to wait 7.5s for a context to be removed from the 
system when they have explicitly killed it themselves. Even the 
regular timeout of 640ms is borderline a long time to wait. And 
note that there is an ongoing request/requirement to increase that 
to 1900ms.

Under what circumstances would a user expect anything sensible to 
happen after a Ctrl+C in terms of things finishing their rendering 
and display nice pretty images? They killed the app. They want it 
dead. We should be getting it off the hardware as quickly as 
possible. If you are really concerned about resets causing 
collateral damage then maybe bump the termination timeout from 1ms 
up to 10ms, maybe at most 100ms. If an app is 'well behaved' then 
it should cleanly exit within 10ms. But if it is bad (which is 
almost certainly the case if the user is manually and explicitly 
killing it) then it needs to be killed because it is not going to 
gracefully exit.

Right.. I had it like that initially (lower timeout - I think 20ms 
or so, patch history on the mailing list would know for sure), but 
then simplified it after review feedback to avoid adding another 
timeout value.

So it's not at all about any expectation that something should 
actually finish to any sort of completion/success. It is primarily 
about not logging an error message when there is no error. Thing to 
keep in mind is that error messages are a big deal in some cultures. 
In addition to that, avoiding needless engine resets is a good thing 
as well.

Previously the execlists backend was over eager and only allowed for 
1ms for such contexts to exit. If the context was banned sure - that 
means it was a bad context which was causing many hangs already. But 
if the context was a clean one I argue there is no point in doing an 
engine reset.

So if you want, I think it is okay to re-introduce a secondary timeout.

Or if you have an idea on how to avoid the error messages / GPU 
resets when "friendly" contexts exit in some other way, that is also 
something to discuss.

Secondly, the whole persistence thing is a total mess, completely 
broken and intended to be massively simplified. See the internal 
task for it. In short, the plan is that all contexts will be 
immediately killed when the last DRM file handle is closed. 
Persistence is only valid between the time the per context file 
handle is closed and the time the master DRM handle is closed. 
Whereas, non-persistent contexts get killed as soon as the per 
context handle is closed. There is absolutely no connection to 
heartbeats or other irrelevant operations.

The change we are discussing is not about persistence, but for the 
persistence itself - I am not sure it is completely broken and if, 
or when, the internal task will result with anything being 
attempted. In the meantime we had unhappy customers for more than a 
year. So do we tell them "please wait for a few years more until 
some internal task with no clear timeline or anyone assigned maybe 
gets looked at"?

So in my view, the best option is to revert the ban vs revoke 
patch. It is creating bugs. It is making persistence more complex 
not simpler. It harms the user experience.

I am not aware of the bugs, even less so that it is harming user 
experience!?

Bugs are limited to the GuC backend or in general? My CI runs were 
clean so maybe test cases are lacking. Is it just a case of 
s/intel_context_is_banned/intel_context_is_schedulable/ in there to 
fix it?

Again, the change was not about persistence. It is the opposite - 
allowing non-persistent contexts to exit cleanly.

If the original problem was simply that error captures were being 
done on Ctrl+C then the fix is simple. Don't capture for a banned 
context. There is no need for all the rest of the revoke patch.

Error capture was not part of the original story so it may be a 
completely orthogonal topic that we are discussing it in this thread.

Wouldn't be good then to separate these two issues: 
banned/exiting/schedulable handling and error capturing of exiting 
context.
This patch handles only the latter, and as I understand there is no 
big controversy that we de not need capture errors for exiting contexts.
If yes, can we ack/merge this patch, to make CI happy and continue 
discussion on the former.

Right, question is if the code in guc_handle_context_reset shouldn't 
be changed to:

     if (likely(!intel_context_is_exiting(ce))) {
        capture_error_state(guc, ce);
         guc_context_replay(ce);
     } else {

And if that should be part of patch which changes a few more instances 
of that same check.

But you wrote that doesn't work? And then Daniele said he thinks it is 
because revoke is not called when hangcheck is disabled and GuC 
backend gets confused? If I got the conversation right..

I wonder if that means equivalent of execlists:

        if (unlikely(intel_context_is_closed(ce) &&
                     !intel_engine_has_heartbeat(engine)))
               intel_context_set_exiting(ce);

Is needed somewhere in the GuC backend. Which with execlists skips 
over the context which is no longer schedulable.

There is nowhere we can put that in the GuC back-end if the context has 
already been handed over to the GuC, because at that point it is out of 
our hands. We need to tell the GuC if we want the context to be dropped.


But I don't understand why testing did not pick up that miss, or the 
miss with guc_context_replay on an exiting context. Or where exactly 
to put the extra handling in the GuC backend. 

My worry here is that some of the bugs seem to pre-date your patch 
(which might be why they weren't flagged in the CI run), so there might 
be something else going on that we're missing.

Perhaps it isn't possible in which case we could have an ugly solution 
where for GuC we do something special in kill_engines() if hangcheck 
is disabled. Maybe add and call a new helper like:

intel_context_exit_nohangcheck()
{
    bool ret = intel_context_set_exiting(ce);

    if (!ret && intel_engine_uses_guc(ce->engine))
        intel_context_ban(ce, NULL);

    return ret;
}

Too ugly?

This works for me if it fixes the issues. The no hangcheck case is not 
common and the user should be careful of what they're running if they 
select it, so IMO we don't need a super pretty or super efficient 
solution, just something that works.

Daniele


Regards,

Tvrtko