Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 15.01.24 18:09, Michel Dänzer wrote:
On 2024-01-15 17:46, Joshua Ashton wrote:
On 1/15/24 16:34, Michel Dänzer wrote:
On 2024-01-15 17:19, Friedrich Vock wrote:
On 15.01.24 16:43, Joshua Ashton wrote:
On 1/15/24 15:25, Michel Dänzer wrote:
On 2024-01-15 14:17, Christian König wrote:
Am 15.01.24 um 12:37 schrieb Joshua Ashton:
On 1/15/24 09:40, Christian König wrote:
Am 13.01.24 um 15:02 schrieb Joshua Ashton:

Without this feedback, the application may keep pushing through
the soft
recoveries, continually hanging the system with jobs that timeout.
Well, that is intentional behavior. Marek is voting for making
soft recovered errors fatal as well while Michel is voting for
better ignoring them.

I'm not really sure what to do. If you guys think that soft
recovered hangs should be fatal as well then we can certainly do
this.
A possible compromise might be making soft resets fatal if they
happen repeatedly (within a certain period of time?).
No, no and no. Aside from introducing issues by side effects not
surfacing and all of the stuff I mentioned about descriptor buffers,
bda, draw indirect and stuff just resulting in more faults and hangs...

You are proposing we throw out every promise we made to an application
on the API contract level because it "might work". That's just wrong!

Let me put this in explicit terms: What you are proposing is in direct
violation of the GL and Vulkan specification.

You can't just chose to break these contracts because you think it
'might' be a better user experience.
Is the original issue that motivated soft resets to be non-fatal even an
issue anymore?

If I read that old thread correctly, the rationale for that was that
assigning guilt to a context was more broken than not doing it, because
the compositor/Xwayland process would also crash despite being unrelated
to the hang.
With Joshua's Mesa fixes, this is not the case anymore, so I don't think
keeping soft resets non-fatal provides any benefit to the user experience.
The potential detriments to user experience have been outlined multiple
times in this thread already.

(I suppose if the compositor itself faults it might still bring down a
session, but I've literally never seen that, and it's not like a
compositor triggering segfaults on CPU stays alive either.)
That's indeed what happened for me, multiple times. And each time the session continued running fine for days after the soft reset.

But apparently my experience isn't valid somehow, and I should have been forced to log in again to please the GL gods...


Conversely, I can't remember hitting a case where an app kept running into soft resets. It's almost as if different people may have different experiences! ;)
Your anecdote of whatever application coincidentally managing to soldier through being hung is really not relevant.
But yours is, got it.
The fundamental problem here is that not telling applications that
something went wrong when you just canceled their work midway is an
out-of-spec hack.
When there is a report of real-world apps breaking because of that hack,
reports of different apps working (even if it's convenient that they
work) doesn't justify keeping the broken code.

I don't think it makes sense to argue against fixing broken code just
because it doesn't affect all apps (and is convenient for some of them).

This isn't anything personal. I don't think your experience is invalid
(I think I just misunderstood the original thread. Sorry if I came
across as condescending!), all I'm arguing is that this should be fixed
somewhere else than the kernel, because what the kernel does right now
makes it impossible to implement modern APIs fully correctly. If mutter
needs to be robust against faults it caused itself, it should be robust
against GPU resets.



It looks like Mutter already has some handling for GL robustness anyway...
That's just for suspend/resume with the nvidia driver. It can't recover from GPU hangs yet.






[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux