Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 1/15/24 18:53, Christian König wrote:
Am 15.01.24 um 19:35 schrieb Joshua Ashton:
On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock <friedrich.vock@xxxxxx <mailto:friedrich.vock@xxxxxx>> wrote:

    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
     > On 2024-01-15 18:26, Friedrich Vock wrote:
     >> [snip]
     >> The fundamental problem here is that not telling applications that      >> something went wrong when you just canceled their work midway is an
     >> out-of-spec hack.
     >> When there is a report of real-world apps breaking because of
    that hack,
     >> reports of different apps working (even if it's convenient that they
     >> work) doesn't justify keeping the broken code.
     > If the breaking apps hit multiple soft resets in a row, I've laid
    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU work is
    interrupted out of nowhere, all bets are off and it might as well
    trigger a full system hang next time. No hang recovery should be able to
    cause that under any circumstance.


I think the more insidious situation is no further hangs but wrong results because we skipped some work. That we skipped work may e.g. result in some texture not being uploaded or some GPGPU work not being done and causing further errors downstream (say if a game is doing AI/physics on the GPU not to say anything of actual GPGPU work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even knowing!

Well on the kernel side we do provide an API to query the result of a submission. That includes canceling submissions with a soft recovery.

What we just doesn't do is to prevent further submissions from this application. E.g. enforcing that the application is punished for bad behavior.

You do prevent future submissions for regular resets though: Those increase karma which sets ctx->guilty, and if ctx->guilty then -ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking the submission fence error somehow? That doesn't seem very ergonomic for a Vulkan driver compared to the simple solution which is to just mark it as guilty with what already exists...

- Joshie 🐸✨



Now imagine this is VulkanSC displaying something in the car dashboard, or some medical device doing some compute work to show something on a graph...

I am not saying you should be doing any of that with RADV + AMDGPU, but it's just food for thought... :-)

As I have been saying, you simply cannot just violate API contracts like this, it's flatout wrong.

Yeah, completely agree to that.

Regards,
Christian.


- Joshie 🐸✨


     >
     >
     >> If mutter needs to be robust against faults it caused itself, it
    should be robust
     >> against GPU resets.
     > It's unlikely that the hangs I've seen were caused by mutter
    itself, more likely Mesa or amdgpu.
     >
     > Anyway, this will happen at some point, the reality is it hasn't
    yet though.
     >
     >



- Joshie 🐸✨



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux