Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Mon, 15 Jan 2024 20:57:10 +0100

Am 15.01.24 um 20:30 schrieb Joshua Ashton:
On 1/15/24 19:19, Christian König wrote:
Am 15.01.24 um 20:13 schrieb Joshua Ashton:
On 1/15/24 18:53, Christian König wrote:
Am 15.01.24 um 19:35 schrieb Joshua Ashton:
On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
<friedrich.vock@xxxxxx <mailto:friedrich.vock@xxxxxx>> wrote:

    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
     > On 2024-01-15 18:26, Friedrich Vock wrote:
     >> [snip]
     >> The fundamental problem here is that not telling 
applications that
     >> something went wrong when you just canceled their work 
midway is an
     >> out-of-spec hack.
     >> When there is a report of real-world apps breaking 
because of
    that hack,
     >> reports of different apps working (even if it's 
convenient that they
     >> work) doesn't justify keeping the broken code.
     > If the breaking apps hit multiple soft resets in a row, 
I've laid
    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU 
work is
    interrupted out of nowhere, all bets are off and it might as 
well
    trigger a full system hang next time. No hang recovery should 
be able to
    cause that under any circumstance.

I think the more insidious situation is no further hangs but 
wrong results because we skipped some work. That we skipped work 
may e.g. result in some texture not being uploaded or some GPGPU 
work not being done and causing further errors downstream (say if 
a game is doing AI/physics on the GPU not to say anything of 
actual GPGPU work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even 
knowing!

Well on the kernel side we do provide an API to query the result of 
a submission. That includes canceling submissions with a soft 
recovery.

What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for 
bad behavior.

You do prevent future submissions for regular resets though: Those 
increase karma which sets ctx->guilty, and if ctx->guilty then 
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't 
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking 
the submission fence error somehow? That doesn't seem very ergonomic 
for a Vulkan driver compared to the simple solution which is to just 
mark it as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of 
reasons.

What we can do rather trivially is changing this code in 
amdgpu_job_prepare_job():

         /* Ignore soft recovered fences here */
         r = drm_sched_entity_error(s_entity);
         if (r && r != -ENODATA)
                 goto error;

This will bubble up errors from soft recoveries into the entity as 
well and makes sure that further submissions are rejected.

That makes sense to do, but at least for GL_EXT_robustness, that will 
not tell the app that it was guilty.

No, it clearly gets that signaled. We should probably replace the guilty 
atomic with a calls to drm_sched_entity_error().

It's just that this isn't what Marek and I had in mind for this, 
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or 
AMDGPU_CTX_OP_QUERY_STATE2.

Instead just look at the return value of the CS or query fence result IOCTL.

When you get an -ENODATA you have been guilty of causing a soft 
recovery, when you get an -ETIME you are guilty of causing a timeout 
which had to be hard recovered. When you get an -ECANCELED you are an 
innocent victim of a hard recovery somebody else caused.

What we haven't defined yet is an error code for loosing VRAM, but that 
should be trivial to do.

Regards,
Christian.

We could always return UNKNOWN_CONTEXT_RESET_EXT in that case, I guess.

I am not super sure what is wrong/egrigious about the ctx->guilty 
handling, is there a previous email thread explaining that?

- Joshie 🐸✨

Regards,
Christian.

- Joshie 🐸✨

Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...

I am not saying you should be doing any of that with RADV + 
AMDGPU, but it's just food for thought... :-)

As I have been saying, you simply cannot just violate API 
contracts like this, it's flatout wrong.

Yeah, completely agree to that.

Regards,
Christian.

- Joshie 🐸✨

     >
     >
     >> If mutter needs to be robust against faults it caused 
itself, it
    should be robust
     >> against GPU resets.
     > It's unlikely that the hangs I've seen were caused by mutter
    itself, more likely Mesa or amdgpu.
     >
     > Anyway, this will happen at some point, the reality is it 
hasn't
    yet though.
     >
     >

- Joshie 🐸✨